Skip to content

address_levels hierarchy inconsistent: 3-level countries have variable depth (LV 62%, SK 55% missing finest level) #509

@yharby

Description

@yharby

Summary

In the 2026-03-18.0 release, several countries listed with 3 address_levels have a variable number of populated levels. When consumers assume the finest level (index [3]) always contains the city/municipality, they lose significant coverage — up to 62% for Latvia and 55% for Slovakia.

This is similar to #367 (US addresses with NULL address_levels[2]), but affects the 3-level countries more severely.

Affected Countries

3-level countries with variable depth

Country Total addresses level3 populated level3 NULL (city must come from level2 or level1) % lost if only checking level3
LV (Latvia) 548,712 208,256 340,456 62.0%
SK (Slovakia) 1,697,528 757,325 940,203 55.4%
EE (Estonia) 2,228,661 2,076,759 151,902 6.8%
IT (Italy) 25,914,431 25,912,438 1,993 <0.01%
TW (Taiwan) 9,630,602 9,630,597 5 <0.01%

Latvia — 3 distinct hierarchy patterns

Pattern 1 (111K): Major cities — only level1 populated
  level1=Rīga, level2=NULL, level3=NULL
  → City IS level1 (Rīga)

Pattern 2 (229K): Novads + town — level1 and level2 populated
  level1=Jēkabpils nov., level2=Jēkabpils, level3=NULL
  → City IS level2 (Jēkabpils)

Pattern 3 (208K): Novads + pagasts + village — all 3 levels
  level1=Olaines nov., level2=Olaines pag., level3=Jāņupe
  → City IS level3 (Jāņupe)

Slovakia — 2 patterns

Pattern 1 (940K): District — level3 NULL
  level1=Bratislavský, level2=Bratislava-Ružinov, level3=NULL
  → City IS level2 (Bratislava-Ružinov)

Pattern 2 (757K): Municipality — all 3 levels
  level1=Prešovský, level2=Spišská Nová Ves, level3=Spišská Nová Ves
  → City IS level3 (Spišská Nová Ves)

Estonia — 2 patterns

Pattern 1 (152K): Linn (town) — level3 NULL
  level1=Ida-Viru maakond, level2=Narva linn, level3=NULL
  → City IS level2 (Narva linn)

Pattern 2 (2.1M): Village/district — all 3 levels
  level1=Harju maakond, level2=Tallinna linn, level3=Kesklinn
  → City IS level3 (Kesklinn)

Also: US (related to #367)

The 2-level US data still has 37.6M addresses (30%) with address_levels[2] = NULL. Of those, 85% (32.1M) have postal_city as a fallback, but 5.5M US addresses have no city information at all — no level2 AND no postal_city. Top states affected: TX (1.7M), MS (852K), CA (575K), FL (455K).

Query to Reproduce

-- Shows all country × depth combinations
SELECT country,
  len(address_levels) as levels_count,
  CASE
    WHEN address_levels[3].value IS NOT NULL THEN 'level3'
    WHEN address_levels[2].value IS NOT NULL THEN 'level2'
    WHEN address_levels[1].value IS NOT NULL THEN 'level1'
    ELSE 'none'
  END AS finest_populated_level,
  count(*) as cnt
FROM read_parquet(
  's3://overturemaps-us-west-2/release/2026-03-18.0/theme=addresses/type=address/*',
  hive_partitioning=0
)
GROUP BY country, levels_count, finest_populated_level
ORDER BY country, levels_count, finest_populated_level

Suggestion

It would help consumers if the documentation clarified that:

  1. address_levels depth is variable within a country — the array length doesn't guarantee all values are populated
  2. The recommended city extraction pattern is a COALESCE cascade (finest → coarsest):
    COALESCE(address_levels[3].value, address_levels[2].value, address_levels[1].value)
  3. For US addresses without level2, postal_city is the intended fallback (and covers 85%)

This would prevent other consumers from hitting the same issue we did when building a geocoder on top of this data.

Environment

  • Release: 2026-03-18.0
  • Queried via DuckDB 1.5 + MotherDuck
  • 39 countries, 469M addresses total

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions