Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 54 additions & 58 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,86 +1,82 @@
# Backend Engineering Challenge
# Unbabel Backend Engineering Challenge - Sequential Processing Approach

This project implements a sequential processing approach to calculate the moving average of translation delivery times. It processes events in chronological order using a sliding window, designed for memory efficiency and simplicity.

Welcome to our Engineering Challenge repository 🖖
## Features

If you found this repository it probably means that you are participating in our recruitment process. Thank you for your time and energy. If that's not the case please take a look at our [openings](https://unbabel.com/careers/) and apply!
- **Sequential Processing**: Efficiently processes events in order.
- **Memory Efficiency**: Utilizes a deque for managing the sliding window.
- **Comprehensive Error Handling**: Robust error handling for file operations and JSON parsing.
- **Logging**: Provides detailed logging for tracking execution and debugging.

Please fork this repo before you start working on the challenge, read it careful and take your time and think about the solution. Also, please fork this repository because we will evaluate the code on the fork.

This is an opportunity for us both to work together and get to know each other in a more technical way. If you have any questions please open and issue and we'll reach out to help.
## Requirements

Good luck!
- Python 3.7 or higher
- pandas

## Challenge Scenario
## Installation

At Unbabel we deal with a lot of translation data. One of the metrics we use for our clients' SLAs is the delivery time of a translation.
1. Clone the repository:

In the context of this problem, and to keep things simple, our translation flow is going to be modeled as only one event.
```bash
git clone https://github.com/komalparakh05/backend-engineering-challenge.git
cd backend-engineering-challenge

### *translation_delivered*

2. Create a virtual environment and install dependencies:

```bash
python -m venv venv
.\venv\Scripts\activate # On Windows
# source venv/bin/activate # On macOS/Linux
pip install -r requirements.txt


## Usage

Run the script with the following command:


python unbabel_cli.py --input_file <path_to_input_file> --window_size <window_size_in_minutes> [--chunk_size <chunk_size>]

Example:

```json
{
"timestamp": "2018-12-26 18:12:19.903159",
"translation_id": "5aa5b2f39f7254a75aa4",
"source_language": "en",
"target_language": "fr",
"client_name": "airliberty",
"event_name": "translation_delivered",
"duration": 20,
"nr_words": 100
}
```

python unbabel_cli.py --input_file events.json --window_size 10 --chunk_size 50000


## Challenge Objective
## Testing

Your mission is to build a simple command line application that parses a stream of events and produces an aggregated output. In this case, we're interested in calculating, for every minute, a moving average of the translation delivery time for the last X minutes.
To test the code, you can create unit tests using a framework like unittest or pytest. Ensure your tests cover key functionalities such as parsing input, calculating averages, and handling errors. Here's how to set up pytest:

Install pytest:


pip install pytest

If we want to count, for each minute, the moving average delivery time of all translations for the past 10 minutes we would call your application like (feel free to name it anything you like!).
Run tests using:

unbabel_cli --input_file events.json --window_size 10

The input file format would be something like:
pytest

{"timestamp": "2018-12-26 18:11:08.509654","translation_id": "5aa5b2f39f7254a75aa5","source_language": "en","target_language": "fr","client_name": "airliberty","event_name": "translation_delivered","nr_words": 30, "duration": 20}
{"timestamp": "2018-12-26 18:15:19.903159","translation_id": "5aa5b2f39f7254a75aa4","source_language": "en","target_language": "fr","client_name": "airliberty","event_name": "translation_delivered","nr_words": 30, "duration": 31}
{"timestamp": "2018-12-26 18:23:19.903159","translation_id": "5aa5b2f39f7254a75bb3","source_language": "en","target_language": "fr","client_name": "taxi-eats","event_name": "translation_delivered","nr_words": 100, "duration": 54}
**Refer to the `test_unbabel_cli.py` script for details on how to use `pytest`.**

Assume that the lines in the input are ordered by the `timestamp` key, from lower (oldest) to higher values, just like in the example input above.

The output file would be something in the following format.
## Design Considerations

```
{"date": "2018-12-26 18:11:00", "average_delivery_time": 0}
{"date": "2018-12-26 18:12:00", "average_delivery_time": 20}
{"date": "2018-12-26 18:13:00", "average_delivery_time": 20}
{"date": "2018-12-26 18:14:00", "average_delivery_time": 20}
{"date": "2018-12-26 18:15:00", "average_delivery_time": 20}
{"date": "2018-12-26 18:16:00", "average_delivery_time": 25.5}
{"date": "2018-12-26 18:17:00", "average_delivery_time": 25.5}
{"date": "2018-12-26 18:18:00", "average_delivery_time": 25.5}
{"date": "2018-12-26 18:19:00", "average_delivery_time": 25.5}
{"date": "2018-12-26 18:20:00", "average_delivery_time": 25.5}
{"date": "2018-12-26 18:21:00", "average_delivery_time": 25.5}
{"date": "2018-12-26 18:22:00", "average_delivery_time": 31}
{"date": "2018-12-26 18:23:00", "average_delivery_time": 31}
{"date": "2018-12-26 18:24:00", "average_delivery_time": 42.5}
```
- **Order of Processing**: Designed to efficiently handle ordered input data.
- **Memory Optimization**: Uses a deque to minimize memory usage while managing the sliding window.
- **Error Resilience**: Incorporates error handling to ensure robust execution.
- **Transparency**: Provides logging to help understand the processing flow and diagnose issues..

#### Notes

Before jumping right into implementation we advise you to think about the solution first. We will evaluate, not only if your solution works but also the following aspects:
## Alternative Approach

+ Simple and easy to read code. Remember that [simple is not easy](https://www.infoq.com/presentations/Simple-Made-Easy)
+ Comment your code. The easier it is to understand the complex parts, the faster and more positive the feedback will be
+ Consider the optimizations you can do, given the order of the input lines
+ Include a README.md that briefly describes how to build and run your code, as well as how to **test it**
+ Be consistent in your code.
An alternative approach that focuses on streaming processing and real-time data handling is available in the `streaming_approach` branch.

Feel free to, in your solution, include some your considerations while doing this challenge. We want you to solve this challenge in the language you feel most comfortable with. Our machines run Python (3.7.x or higher) or Go (1.16.x or higher). If you are thinking of using any other programming language please reach out to us first 🙏.
To explore this alternative approach, check out the `streaming_approach` branch:

Also, if you have any problem please **open an issue**.
git checkout streaming_approach

Good luck and may the force be with you
This alternative solution showcases additional data engineering skills and provides a more comprehensive approach to the challenge.
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
pandas
49 changes: 49 additions & 0 deletions tests/test_your_script.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
import pytest
from unittest.mock import mock_open, patch
from unbabel_cli import calculate_moving_average

def test_calculate_moving_average(capsys):
# Sample input data
input_data = """{"timestamp": "2018-12-26 18:11:08.509654", "duration": 20}
{"timestamp": "2018-12-26 18:15:19.903159", "duration": 31}
{"timestamp": "2018-12-26 18:23:19.903159", "duration": 54}
"""

# Expected output based on the actual function behavior
expected_output = [
'{"date": "2018-12-26 18:11:00", "average_delivery_time": 0}',
'{"date": "2018-12-26 18:12:00", "average_delivery_time": 20.0}',
'{"date": "2018-12-26 18:13:00", "average_delivery_time": 20.0}',
'{"date": "2018-12-26 18:14:00", "average_delivery_time": 20.0}',
'{"date": "2018-12-26 18:15:00", "average_delivery_time": 20.0}',
'{"date": "2018-12-26 18:16:00", "average_delivery_time": 25.5}',
'{"date": "2018-12-26 18:17:00", "average_delivery_time": 25.5}',
'{"date": "2018-12-26 18:18:00", "average_delivery_time": 25.5}',
'{"date": "2018-12-26 18:19:00", "average_delivery_time": 25.5}',
'{"date": "2018-12-26 18:20:00", "average_delivery_time": 25.5}',
'{"date": "2018-12-26 18:21:00", "average_delivery_time": 25.5}',
'{"date": "2018-12-26 18:22:00", "average_delivery_time": 31.0}',
'{"date": "2018-12-26 18:23:00", "average_delivery_time": 31.0}',
'{"date": "2018-12-26 18:24:00", "average_delivery_time": 42.5}',
'{"date": "2018-12-26 18:25:00", "average_delivery_time": 42.5}',
'{"date": "2018-12-26 18:26:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:27:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:28:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:29:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:30:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:31:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:32:00", "average_delivery_time": 54.0}',
'{"date": "2018-12-26 18:33:00", "average_delivery_time": 54.0}',
]

# Use mock_open to simulate file reading
with patch("builtins.open", mock_open(read_data=input_data)):
# Call the function with a mock file path and window size
calculate_moving_average("mock_file_path", 10)

# Capture the output
captured = capsys.readouterr()

# Split the output into lines and compare with expected output
actual_output = [line.strip() for line in captured.out.splitlines()]
assert actual_output == expected_output
106 changes: 106 additions & 0 deletions unbabel_cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
#!/usr/bin/env python3

import argparse
import json
from datetime import datetime, timedelta
from collections import deque
import logging

# Set up logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

def parse_arguments():
"""Parse command line arguments for input file and window size."""
parser = argparse.ArgumentParser(description='Calculate moving average of translation delivery times.')
parser.add_argument('--input_file', required=True, help='Path to the input JSON file')
parser.add_argument('--window_size', type=int, required=True, help='Window size in minutes')
return parser.parse_args()

def parse_timestamp(timestamp_str):
"""Convert timestamp string to a datetime object."""
try:
return datetime.strptime(timestamp_str, "%Y-%m-%d %H:%M:%S.%f")
except ValueError as e:
logger.error(f"Timestamp parsing error: {e}")
raise

def calculate_moving_average(input_file, window_size):
"""Calculate and print the moving average of delivery times."""
window = deque() # Initialize a deque to store events within the window
current_minute = None # Track the current minute being processed
last_event_minute = None # Track the last event's minute

try:
with open(input_file, 'r') as f:
for line_num, line in enumerate(f, 1):
try:
event = json.loads(line) # Parse each line as a JSON event
except json.JSONDecodeError as e:
logger.warning(f"JSON parsing error on line {line_num}: {e}")
continue

timestamp = parse_timestamp(event['timestamp'])
minute = timestamp.replace(second=0, microsecond=0) # Round down to the nearest minute

if current_minute is None:
current_minute = minute

last_event_minute = minute

# Process each minute up to the current event's minute
while current_minute <= minute:
# Remove events that are outside the window
while window and (current_minute - parse_timestamp(window[0]['timestamp'])).total_seconds() / 60 >= window_size:
window.popleft()

# Calculate the average delivery time for the current minute
if window:
avg_delivery_time = sum(e['duration'] for e in window) / len(window)
else:
avg_delivery_time = 0

# Output the result for the current minute
print(json.dumps({
"date": current_minute.strftime("%Y-%m-%d %H:%M:%S"),
"average_delivery_time": round(avg_delivery_time, 1)
}))

current_minute += timedelta(minutes=1)

# Add the current event to the window
window.append(event)

# Continue processing for the remaining minutes after the last event
# This ensures we calculate the moving average for the entire window size after the last event
while current_minute <= last_event_minute + timedelta(minutes=window_size):
# Remove events that are outside the window
while window and (current_minute - parse_timestamp(window[0]['timestamp'])).total_seconds() / 60 >= window_size:
window.popleft()

# Calculate the average delivery time for the current minute
if window:
avg_delivery_time = sum(e['duration'] for e in window) / len(window)
else:
avg_delivery_time = 0

# Output the result for the current minute
print(json.dumps({
"date": current_minute.strftime("%Y-%m-%d %H:%M:%S"),
"average_delivery_time": round(avg_delivery_time, 1)
}))

current_minute += timedelta(minutes=1)

except FileNotFoundError:
logger.error(f"File not found: {input_file}")
except Exception as e:
logger.error(f"An unexpected error occurred: {e}")

def main():
"""Main function to parse arguments and calculate moving averages."""
args = parse_arguments()
calculate_moving_average(args.input_file, args.window_size)

if __name__ == "__main__":
main()