A fully automated Python scraper that extracts crested gecko images from online marketplaces (starting with MorphMarket.com), normalizes genetic trait names, filters for quality, deduplicates images, and organizes them for AutoML training.
- 🌐 Multi-Site Support - Modular architecture ready for multiple marketplace sources
- 🏷️ Automatic Trait Discovery - Extracts and normalizes 50+ genetic traits with fuzzy matching
- 🔍 Intelligent Deduplication - Perceptual hashing to skip duplicate images
- ✅ Image Quality Validation - Ensures images meet AutoML training standards (min 512×512)
- 📊 Dataset Preparation - Organizes images by traits for machine learning
- 🧪 Test Mode - Configurable test/production modes with different limits
- Trait Normalization - Handles spelling variations (Harlequin/Harley/Harli → Harlequin)
- Fuzzy Matching - Levenshtein distance matching for unknown traits
- Auto-Discovery - Automatically adds new traits after 3 occurrences
- Gentle Retry Logic - 5 retry attempts with exponential backoff (10s→300s)
- Progress Checkpoints - Resume scraping from last checkpoint
- Comprehensive Reporting - Detailed statistics and daily summary reports
- Multi-label support (geckos can have up to 3 traits)
- Stratified train/validation/test splits (70/15/15)
- Multiple export formats (Vertex AI, AWS Rekognition, COCO JSON)
- Quality filtering (resolution, aspect ratio, file size)
- Metadata tracking for all images
- Python 3.8 or higher
- Google Chrome (for Selenium)
-
Clone or download this repository
-
Install dependencies:
pip install -r requirements.txt- (Optional) Review and customize configuration files:
config/config.yaml- Main configurationconfig/trait_normalization.yaml- Trait dictionary
Edit config/config.yaml:
# Set to false for production (unlimited pages)
TEST_MODE: trueTest Mode (default):
- Page limit: 5 pages
- Delay: 2-3 seconds
- Good for initial testing
Production Mode:
- Page limit: Unlimited
- Delay: 3-5 seconds
- Full scraping
- Max traits per gecko: Currently set to 3 (skip geckos with 4+ traits)
- Image quality: Min 512×512, max 4096×4096, aspect ratio 0.33-3.0
- Duplicate thresholds: Distance ≤5 (exact), ≤10 (near-duplicate)
- Retry logic: 5 attempts with exponential backoff
Run the scraper:
python main.py- Initialization - Loads config, initializes database and components
- Scraping - Visits MorphMarket pages, extracts listings
- Trait Extraction - Identifies genetic traits from titles/descriptions
- Normalization - Maps trait variations to canonical names
- Filtering - Skips geckos with 4+ traits or no images
- Image Download - Downloads images with retry logic
- Quality Check - Validates resolution, aspect ratio, format
- Deduplication - Skips duplicate images using perceptual hashing
- Organization - Saves images to trait-specific folders
- Reporting - Generates summary report with statistics
CrestedGeckoMorph/
├── data/
│ ├── images/
│ │ ├── raw/ # Original downloads
│ │ │ └── MorphMarket/
│ │ │ └── [gecko_id]/
│ │ │ └── MorphMarket_[id]_1.jpg
│ │ └── by_trait/ # Organized by trait
│ │ ├── Harlequin/
│ │ ├── Pinstripe/
│ │ ├── Lilly White/
│ │ └── ...
│ ├── gecko_data.db # SQLite database
│ ├── unknown_traits.json # Unknown traits log
│ ├── new_traits.yaml # Auto-discovered traits
│ ├── logs/
│ │ └── scraper.log
│ └── reports/
│ └── summary_YYYY-MM-DD_HH-MM-SS.json
The scraper includes a comprehensive trait dictionary with 50+ traits:
- Harlequin (harley, harli, harly)
- Pinstripe (pin, pinner)
- Tiger, Flame, Dalmatian
- Halloween, Tricolor, Brindle
- Lilly White (lilly-white, lillywhite, LW)
- Axanthic (axan, axa)
- Phantom, Lavender
- Frappuccino, Cappuccino, Mocha
- Fringe, Patternless
- Red Phantom, Extreme variants
Auto-Discovery: Unknown traits appearing 3+ times are automatically added.
Each run generates a detailed JSON report including:
- Scraping Stats: Pages scraped, listings processed/skipped
- Image Stats: Downloaded, duplicates skipped, quality rejections
- Trait Stats: Normalizations, fuzzy matches, auto-added traits
- Skip Breakdown: Reasons for skipping listings (4+ traits, no images, etc.)
- Error Log: All errors encountered during scraping
Example summary output:
===============================================================
SCRAPING SESSION SUMMARY
===============================================================
⏱️ Duration: 0:15:32
🧪 Test Mode: Yes
📊 SCRAPING RESULTS:
Pages scraped: 5
Listings found: 48
Listings processed: 32
Listings skipped: 16
🖼️ IMAGE PROCESSING:
Images downloaded: 95
Duplicates skipped: 12
Quality rejections: 8
🏷️ TRAITS:
Newly discovered traits: 3
SQLite database (data/gecko_data.db) contains:
Tables:
geckos- Listing information, traits, metadataimages- Image paths, hashes, dimensionscheckpoints- Progress tracking for resume
Indexes on hash fields for fast duplicate detection.
The modular architecture supports adding new scraping sources:
- Create new scraper in
src/scrapers/extendingBaseScraper - Implement abstract methods:
get_search_url(),parse_listing_page(), etc. - Add configuration to
config/config.yamlundersources: - Update
main.pyto initialize new scraper
Example scrapers that can be added:
- Fauna Classifieds
- Gecko Time Classifieds
- MorphMarket US/UK/EU variants
The organized dataset is optimized for training image classification models:
- Multi-label classification (geckos can have multiple traits)
- Quality-filtered images (512×512 minimum, proper aspect ratios)
- Deduplicated to prevent training on duplicates
- Stratified splits for balanced training/validation/test sets
- Multiple export formats (Vertex AI CSV, AWS manifest, COCO JSON)
- Google Vertex AI AutoML Vision
- Azure Custom Vision
- AWS Rekognition Custom Labels
- 100 images per trait (minimum for training)
- 500+ images per trait (recommended for good accuracy)
- 2000+ total images for robust multi-label classification
If Selenium can't find Chrome:
# Install/update webdriver
pip install --upgrade webdriver-managerIf database is locked:
# Close any other processes using the database
# Or delete data/gecko_data.db to start freshMake sure you're running from the project root:
cd CrestedGeckoMorph
python main.pyIf MorphMarket changes their HTML structure:
- Update selectors in
src/scrapers/morphmarket_scraper.py - Check
extract_listing_urls()andparse_listing_page()methods
Scraper:
page_limit: Max pages to scrape (null = unlimited)delay_min/max: Random delay between requests (seconds)max_retries: Number of retry attempts (5)timeout: Request timeout (45s)
Traits:
max_traits_per_gecko: Skip if more than this (3)fuzzy_match_threshold: Levenshtein distance (2)auto_add_threshold: Auto-add after N occurrences (3)
Images:
min_width/height: Minimum resolution (512×512)max_aspect_ratio: Max aspect ratio (3.0 = 3:1)max_file_size_mb: Maximum file size (10MB)
Duplicates:
exact_duplicate: Hash distance threshold (5)near_duplicate: Near-duplicate threshold (10)
This tool is for educational and research purposes. Respect website terms of service and use responsible scraping practices with appropriate delays.
Feel free to:
- Add new marketplace scrapers
- Improve trait extraction algorithms
- Enhance image quality validation
- Add data augmentation features
Version 1.0 - January 2026
Note: Always use responsible scraping practices. The default configuration includes generous delays (2-5 seconds) to avoid overloading servers.