Skip to content

Filter large number of prefixes with a input file #117

@digizeph

Description

@digizeph

Description

At times, we need to filter on a large number of prefixes. CLI with prefix filter parameters would be insufficient. We should allow users to provide an input file with filtering criteria written.

This may involve updating bgpkit-parser for large number of filters performance. We could design a general filters file format (e.g. a JSON format) and allow taking that as input.

Concrete Use Cases

1. Prefix-list filtering from RIB extraction

The most common pattern that hits this limit is BGP outage/visibility investigation:

  1. Extract the prefix list for a target ASN from a pre-event RIB dump
  2. Use that prefix list to filter subsequent BGP update files

This is necessary because BGP withdrawal messages carry no AS path or origin ASN — you can only filter withdrawals by prefix. The current workaround is:

# Step 1: Extract prefix list from RIB
monocle parse -o <ASN> /tmp/rib_pre.gz 2>/dev/null | \
  cut -d'|' -f5 | sort -u > /tmp/prefixes.txt

# Step 2: Build comma-separated string
PFXS=$(cat /tmp/prefixes.txt | tr '\n' ',' | sed 's/,$//')

# Step 3: Pass as -p argument
monocle parse -p "$PFXS" /tmp/updates.gz

This breaks down at scale:

ASN size Approx prefixes -p arg size Works?
Small ISP ~200 ~3.6 KB Yes
Medium ISP ~1,000 ~18 KB Yes
Large carrier ~5,000 ~90 KB Marginal
Tier-1 / hyper ~10,000+ ~180 KB+ Exceeds ARG_MAX on many systems

The ARG_MAX limit on Linux is typically ~2 MB but the effective limit for a single argument can be much lower (~128-256 KB). Even below ARG_MAX, very long arguments cause performance issues in shell expansion.

2. Country-level investigation

When investigating a country-level event, you may need to filter by all prefixes originated by ASNs in that country — potentially tens of thousands of prefixes. This is impractical with -p on the command line.

3. Repeated parsing with the same filter set

In a typical investigation, the same prefix list is applied to 10-40+ update files sequentially. Each invocation re-parses the comma-separated -p argument from scratch. A file-based input that's parsed once and reused across invocations would be more efficient.

Proposed Design

Filter file format (JSON)

{
  "prefixes": ["192.0.2.0/24", "198.51.100.0/24", "2001:db8::/32"],
  "origin_asns": [64496, 64497],
  "peer_asns": [174, 6939],
  "as_path_regex": "174 64496$",
  "elem_type": "w",
  "communities": ["64496:100", "64496:200"]
}

All fields optional; when multiple fields are present, they combine with AND logic (same as existing CLI filters).

CLI integration

# Use filter file instead of CLI flags
monocle parse --filter-file /tmp/filters.json /tmp/updates.gz
monocle search --filter-file /tmp/filters.json -t 2025-09-01T12:00:00Z -d 2h

# Filter file can be combined with CLI flags (AND logic)
monocle parse --filter-file /tmp/filters.json -c rrc00 /tmp/updates.gz

Alternative: plain text prefix list

For the most common case (prefix-only filtering), also support a simple newline-delimited file:

# One prefix per line
monocle parse --prefix-file /tmp/prefixes.txt /tmp/updates.gz

This is the most ergonomic option for the RIB-extract-then-filter-updates workflow, since monocle parse -o <ASN> rib.gz | cut -d'|' -f5 | sort -u already produces a newline-delimited prefix list.

Interaction with #82

If #82 adds RIB snapshot queries with --sqlite-path output, the extracted prefix list could be queried from SQLite and fed as a filter file to subsequent update analysis — replacing the current fragile shell pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions