Skip to content

grasberg/CrestedGeckoMorph

Repository files navigation

Crested Gecko Morph Scraper

A fully automated Python scraper that extracts crested gecko images from online marketplaces (starting with MorphMarket.com), normalizes genetic trait names, filters for quality, deduplicates images, and organizes them for AutoML training.

Features

Core Functionality

  • 🌐 Multi-Site Support - Modular architecture ready for multiple marketplace sources
  • 🏷️ Automatic Trait Discovery - Extracts and normalizes 50+ genetic traits with fuzzy matching
  • 🔍 Intelligent Deduplication - Perceptual hashing to skip duplicate images
  • Image Quality Validation - Ensures images meet AutoML training standards (min 512×512)
  • 📊 Dataset Preparation - Organizes images by traits for machine learning
  • 🧪 Test Mode - Configurable test/production modes with different limits

Advanced Features

  • Trait Normalization - Handles spelling variations (Harlequin/Harley/Harli → Harlequin)
  • Fuzzy Matching - Levenshtein distance matching for unknown traits
  • Auto-Discovery - Automatically adds new traits after 3 occurrences
  • Gentle Retry Logic - 5 retry attempts with exponential backoff (10s→300s)
  • Progress Checkpoints - Resume scraping from last checkpoint
  • Comprehensive Reporting - Detailed statistics and daily summary reports

AutoML Optimization

  • Multi-label support (geckos can have up to 3 traits)
  • Stratified train/validation/test splits (70/15/15)
  • Multiple export formats (Vertex AI, AWS Rekognition, COCO JSON)
  • Quality filtering (resolution, aspect ratio, file size)
  • Metadata tracking for all images

Installation

Prerequisites

  • Python 3.8 or higher
  • Google Chrome (for Selenium)

Setup

  1. Clone or download this repository

  2. Install dependencies:

pip install -r requirements.txt
  1. (Optional) Review and customize configuration files:
    • config/config.yaml - Main configuration
    • config/trait_normalization.yaml - Trait dictionary

Configuration

Test vs Production Mode

Edit config/config.yaml:

# Set to false for production (unlimited pages)
TEST_MODE: true

Test Mode (default):

  • Page limit: 5 pages
  • Delay: 2-3 seconds
  • Good for initial testing

Production Mode:

  • Page limit: Unlimited
  • Delay: 3-5 seconds
  • Full scraping

Other Settings

  • Max traits per gecko: Currently set to 3 (skip geckos with 4+ traits)
  • Image quality: Min 512×512, max 4096×4096, aspect ratio 0.33-3.0
  • Duplicate thresholds: Distance ≤5 (exact), ≤10 (near-duplicate)
  • Retry logic: 5 attempts with exponential backoff

Usage

Basic Usage

Run the scraper:

python main.py

What Happens

  1. Initialization - Loads config, initializes database and components
  2. Scraping - Visits MorphMarket pages, extracts listings
  3. Trait Extraction - Identifies genetic traits from titles/descriptions
  4. Normalization - Maps trait variations to canonical names
  5. Filtering - Skips geckos with 4+ traits or no images
  6. Image Download - Downloads images with retry logic
  7. Quality Check - Validates resolution, aspect ratio, format
  8. Deduplication - Skips duplicate images using perceptual hashing
  9. Organization - Saves images to trait-specific folders
  10. Reporting - Generates summary report with statistics

Output Structure

CrestedGeckoMorph/
├── data/
│   ├── images/
│   │   ├── raw/                     # Original downloads
│   │   │   └── MorphMarket/
│   │   │       └── [gecko_id]/
│   │   │           └── MorphMarket_[id]_1.jpg
│   │   └── by_trait/                # Organized by trait
│   │       ├── Harlequin/
│   │       ├── Pinstripe/
│   │       ├── Lilly White/
│   │       └── ...
│   ├── gecko_data.db                # SQLite database
│   ├── unknown_traits.json          # Unknown traits log
│   ├── new_traits.yaml              # Auto-discovered traits
│   ├── logs/
│   │   └── scraper.log
│   └── reports/
│       └── summary_YYYY-MM-DD_HH-MM-SS.json

Trait Normalization

The scraper includes a comprehensive trait dictionary with 50+ traits:

Pattern Traits

  • Harlequin (harley, harli, harly)
  • Pinstripe (pin, pinner)
  • Tiger, Flame, Dalmatian
  • Halloween, Tricolor, Brindle

Color Morphs

  • Lilly White (lilly-white, lillywhite, LW)
  • Axanthic (axan, axa)
  • Phantom, Lavender

Special Traits

  • Frappuccino, Cappuccino, Mocha
  • Fringe, Patternless
  • Red Phantom, Extreme variants

Auto-Discovery: Unknown traits appearing 3+ times are automatically added.

Statistics & Reporting

Each run generates a detailed JSON report including:

  • Scraping Stats: Pages scraped, listings processed/skipped
  • Image Stats: Downloaded, duplicates skipped, quality rejections
  • Trait Stats: Normalizations, fuzzy matches, auto-added traits
  • Skip Breakdown: Reasons for skipping listings (4+ traits, no images, etc.)
  • Error Log: All errors encountered during scraping

Example summary output:

===============================================================
 SCRAPING SESSION SUMMARY
===============================================================

⏱️  Duration: 0:15:32
🧪 Test Mode: Yes

📊 SCRAPING RESULTS:
   Pages scraped: 5
   Listings found: 48
   Listings processed: 32
   Listings skipped: 16

🖼️  IMAGE PROCESSING:
   Images downloaded: 95
   Duplicates skipped: 12
   Quality rejections: 8

🏷️  TRAITS:
   Newly discovered traits: 3

Database Schema

SQLite database (data/gecko_data.db) contains:

Tables:

  • geckos - Listing information, traits, metadata
  • images - Image paths, hashes, dimensions
  • checkpoints - Progress tracking for resume

Indexes on hash fields for fast duplicate detection.

Extending to Other Sites

The modular architecture supports adding new scraping sources:

  1. Create new scraper in src/scrapers/ extending BaseScraper
  2. Implement abstract methods: get_search_url(), parse_listing_page(), etc.
  3. Add configuration to config/config.yaml under sources:
  4. Update main.py to initialize new scraper

Example scrapers that can be added:

  • Fauna Classifieds
  • Gecko Time Classifieds
  • MorphMarket US/UK/EU variants

Future Use: AutoML Training

The organized dataset is optimized for training image classification models:

Dataset Features

  • Multi-label classification (geckos can have multiple traits)
  • Quality-filtered images (512×512 minimum, proper aspect ratios)
  • Deduplicated to prevent training on duplicates
  • Stratified splits for balanced training/validation/test sets
  • Multiple export formats (Vertex AI CSV, AWS manifest, COCO JSON)

Recommended AutoML Platforms

  • Google Vertex AI AutoML Vision
  • Azure Custom Vision
  • AWS Rekognition Custom Labels

Minimum Dataset Size

  • 100 images per trait (minimum for training)
  • 500+ images per trait (recommended for good accuracy)
  • 2000+ total images for robust multi-label classification

Troubleshooting

Chrome Driver Issues

If Selenium can't find Chrome:

# Install/update webdriver
pip install --upgrade webdriver-manager

Database Locked

If database is locked:

# Close any other processes using the database
# Or delete data/gecko_data.db to start fresh

Import Errors

Make sure you're running from the project root:

cd CrestedGeckoMorph
python main.py

Site Changes

If MorphMarket changes their HTML structure:

  • Update selectors in src/scrapers/morphmarket_scraper.py
  • Check extract_listing_urls() and parse_listing_page() methods

Configuration Reference

Key Settings

Scraper:

  • page_limit: Max pages to scrape (null = unlimited)
  • delay_min/max: Random delay between requests (seconds)
  • max_retries: Number of retry attempts (5)
  • timeout: Request timeout (45s)

Traits:

  • max_traits_per_gecko: Skip if more than this (3)
  • fuzzy_match_threshold: Levenshtein distance (2)
  • auto_add_threshold: Auto-add after N occurrences (3)

Images:

  • min_width/height: Minimum resolution (512×512)
  • max_aspect_ratio: Max aspect ratio (3.0 = 3:1)
  • max_file_size_mb: Maximum file size (10MB)

Duplicates:

  • exact_duplicate: Hash distance threshold (5)
  • near_duplicate: Near-duplicate threshold (10)

License

This tool is for educational and research purposes. Respect website terms of service and use responsible scraping practices with appropriate delays.

Contributing

Feel free to:

  • Add new marketplace scrapers
  • Improve trait extraction algorithms
  • Enhance image quality validation
  • Add data augmentation features

Version

Version 1.0 - January 2026


Note: Always use responsible scraping practices. The default configuration includes generous delays (2-5 seconds) to avoid overloading servers.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages