Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions TASK2_PROMPT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Python Screening Task 2: Write a Prompt for an AI Debugging Assistant

## Prompt

You are an AI assistant helping a student debug their Python code.
Your role is to carefully analyze the provided code and guide the student by:
- Identifying possible errors, mistakes, or inefficiencies.
- Asking leading questions that encourage the student to think critically.
- Giving hints or suggestions without directly providing the full solution.
- Maintaining a supportive, educational, and beginner-friendly tone.

When responding:
- Start by summarizing what the code is attempting to do.
- Point out suspicious lines, logic, or syntax that may cause issues.
- Suggest ways the student can test or debug those parts.
- If the bug relates to common Python pitfalls (e.g., indentation, variable scope, type errors), explain the concept with a simple example (but not the exact fix).
- Encourage the student to try fixing it themselves and share their updated attempt.

---

## Reasoning

### Why I worded it this way
The wording ensures the AI acts as a **mentor** rather than giving direct answers. It emphasizes constructive feedback and guiding questions, which mirrors how a human teacher would help.

### How it avoids giving away the solution
Instead of saying *“replace X with Y”*, the prompt instructs the AI to only highlight problem areas and suggest debugging approaches (like printing variables, checking loops, or revisiting syntax).

### How it encourages helpful, student-friendly feedback
The prompt explicitly asks the AI to:
- Use a supportive tone,
- Explain concepts simply,
- Encourage student engagement (e.g., “try printing this variable and see what happens”),
which makes the interaction motivating rather than discouraging.

---

## Additional Reasoning Questions (Required by Task)

1. **Tone and Style**
- The AI should be friendly, patient, and clear. It should avoid technical jargon unless explained with simple examples.

2. **Balancing bug identification vs. guidance**
- The AI should identify likely problem spots but stop short of providing fixes. Instead, it should suggest tests or alternative approaches for the student to try.

3. **Adapting for beginner vs. advanced learners**
- For beginners: give more detailed explanations of concepts (e.g., “In Python, indentation matters because…”).
- For advanced learners: be more concise and focus on logic or optimization hints rather than syntax basics.
31 changes: 31 additions & 0 deletions TASK3_RESEARCH.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# TASK 3 — Research plan: Evaluating open-source models for student competence analysis

## Research plan (2 paragraphs)

**Paragraph 1 — Goal & scope.**
My goal is to evaluate one or more freely available models for the specific task of *high-level student competence analysis* in Python programming. I will focus on models that can (a) analyze a short Python function or snippet, (b) identify conceptual gaps or misconceptions (for example misuse of loops, off-by-one logic, incorrect use of types), and (c) generate prompts or diagnostic questions that help a student reflect and learn — without giving away exact fixes. To keep the evaluation concrete, I will use a small benchmark of 10 student-written Python snippets covering typical beginner mistakes (variable scoping, loop logic, off-by-one, incorrect conditionals, misuse of built-in functions). Evaluation will assess whether the model: identifies likely misconceptions, suggests lines/areas to inspect, and proposes scaffolded prompts that encourage reasoning rather than direct answers.

**Paragraph 2 — approach & plan.**
I will evaluate one open-source model from the family of medium-size LLMs (for example an openly licensed LLaMA derivative or similar model available via Hugging Face). For this short research plan I will: (1) prepare the benchmark of 10 Python examples and an expected checklist of conceptual issues for each example; (2) prompt the model with the same structured instruction (see NOTES) and collect its responses; (3) measure suitability using a small rubric: (A) *Detection accuracy* — whether the model mentions the correct conceptual issue(s) (scored 0/1 each), (B) *Helpfulness of hints* — how well hints point students to inspect code (0–2), and (C) *No-direct-answer* — whether the reply avoids giving the full solution (true/false). I will summarize results in a small table and provide qualitative examples of good and bad model outputs. If time permits, I will also test a lighter-weight programming-specific model and compare its trade-offs in response quality vs. latency and interpretability.

---

## Why this model & criteria

- **Why this model:** I choose a moderately sized open model (free to use, small inference cost) because it hits a balance: expressive enough to reason about code but small enough to run locally or on low-cost inference. The goal is not to reach state-of-the-art correctness but to evaluate whether off-the-shelf open models are *suitable* as competence-analytics assistants with minimal fine-tuning.
- **Criteria for suitability:** (1) *Accuracy* — identifies the conceptual issue; (2) *Pedagogical tone* — provides scaffolded prompts that guide thinking (mentor-like); (3) *Non-revealing* — does not provide a full solution; (4) *Explainability* — gives short rationale for its suggestion; (5) *Resource trade-offs* — latency and inference cost for potential classroom use.

---

## How I will test / validate

1. **Dataset:** 10 short Python snippets with known conceptual issues (each annotated with expected conceptual labels).
2. **Prompting:** Use a single, repeatable prompt template that asks the model to (a) identify likely conceptual mistake(s), (b) suggest places in the code to inspect, (c) provide 2 scaffolded questions for the student, and (d) explicitly avoid giving the fixed code. (This prompt will be included in the repo as `TASK3_PROMPT_TEMPLATE.md` if helpful.)
3. **Metrics & analysis:** For each example compute boolean detection accuracy, score hint helpfulness (0–2), and flag if the model revealed the full fix. Aggregate results and present qualitative examples. Discuss limitations, failure modes, and next steps (fine-tuning, additional prompt engineering, or using program-analysis tools alongside the model).

---

## References & notes (short)
- I will use freely-available model weights from Hugging Face or a permissively-licensed LLM repository (document exact model name in README).
- I will include the prompt template, dataset (inlined small examples), and evaluation rubric in the repo so reviewers can reproduce results.

Loading