Skip to content

gen-yang/oam-live-coding-engineer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OAM Pipeline — Live Coding Exercise

This repo contains a small data pipeline for the Oil & Gas Asset Monitoring (OAM) system we discussed in the design exercise.

The pipeline reads satellite observations of floating-roof oil tanks, computes volumes from fill-level measurements, and produces a gap-filled daily timeseries for a client dashboard.

Time budget: ~20 minutes total.


Setup

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

Run the pipeline and the tests to check everything works:

python pipeline/ingest.py
pytest tests/ -v

Repo layout

data/
  tanks.csv            5 monitored tanks with dimensions
  observations.csv     50 satellite observations (fill-level measurements)
pipeline/
  ingest.py            Pipeline logic  ← focus here
  models.py            Dataclasses (Tank, Observation) — reference only
tests/
  test_ingest.py       Test suite

Part 1 — Code Review (~8 min)

Read through pipeline/ingest.py. This was written as a quick prototype — it works on the current data, but we'd like to harden it before deploying to production.

As you read, think about:

  • Robustness — what assumptions does the code make that might break?
  • Scale — we track 5 tanks today, but want to grow to 300 sites. Any concerns?
  • Observability — would you be comfortable operating this in production?

Talk through your thinking as you go.


Part 2 — Pair Coding (~10 min)

Add input validation to the pipeline. Write a function:

def validate_observations(df: pd.DataFrame) -> pd.DataFrame:

It should:

  1. Check required columns — raise ValueError if any of these are missing: site_id, tank_id, observed_at, confidence_score, image_id

  2. Drop bad rows — remove (and log) any rows where image_id is null or empty, since those observations can't be traced back to their source image.

  3. Return the cleaned DataFrame.

Call it at the top of run_pipeline(), right after loading the data.

This is collaborative — your interviewer may suggest additional checks based on the review discussion. Feel free to add tests if time allows.


Notes

  • Use any reference material you normally would (docs, search, etc.)
  • There's no single right answer — working, readable, defensive code is the goal
  • If you get stuck, talking through your approach is just as valuable

About

Live Coding Exercise for Data Engineer recruitment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages