Skip to content

PostgreSqlDialect accepts large amounts of non-PostgreSQL syntax #2237

@LucaCappelletti94

Description

@LucaCappelletti94

While building a parser correctness benchmark using libpg_query (pg_query.rs) as the PostgreSQL ground truth, we measured how often PostgreSqlDialect accepts SQL that real PostgreSQL rejects. The numbers are surprisingly high.

Against SQL extracted from the sqlparser-rs test suite itself:

  • 28.7% of statements rejected by pg_query are silently accepted by PostgreSqlDialect (37/129, PostgreSQL-specific test file)
  • 30.0% in the broader common-dialect test file (141/470)

We understand sqlparser-rs is intentionally permissive. The question is: is this level of permissiveness intentional for PostgreSqlDialect, or is it leakage that would be worth tightening?

Examples of what PostgreSqlDialect currently accepts

A selection from the 141 cases found, grouped by the dialect the syntax originates from:

-- Oracle
FETCH NEXT IN my_cursor INTO result_table   -- INTO clause on FETCH

-- SQL Server / T-SQL
SELECT TOP 3 * FROM tbl
EXEC my_proc N'param'
MERGE … OUTPUT inserted.* INTO log_target
EXECUTE FUNCTION f                          -- trigger EXECUTE without ()

-- MySQL / MariaDB
INSERT customer VALUES (1, 2, 3)            -- missing INTO
INSERT OR REPLACE INTO t (id) VALUES(1)
DROP FUNCTION IF EXISTS f(a INTEGER, IN b INTEGER = 1)  -- defaults in DROP

-- Snowflake / BigQuery
SELECT i FROM qt QUALIFY ROW_NUMBER() OVER (...) = 1
CREATE OR REPLACE TABLE t (a INT)
CREATE OR REPLACE USER IF NOT EXISTS u1 PASSWORD='secret'

-- ClickHouse
ALTER TABLE t ON CLUSTER my_cluster ADD CONSTRAINT bar PRIMARY KEY (baz)

-- HiveQL
ALTER TABLE t SET TBLPROPERTIES('classification' = 'parquet')

-- Unclear origin / possibly over-permissive parsing
ALTER TABLE t ALTER COLUMN id ADD GENERATED AS IDENTITY  -- missing ALWAYS/BY DEFAULT
COPY t FROM 'f.csv' BINARY DELIMITER ',' CSV HEADER      -- mutually exclusive formats
SHOW search_path search_path                             -- duplicate trailing token

Happy to help with PRs if the direction is clear.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions