Skip to content

[SPARK-55981][SQL] Allow Geo Types with SRID's from the pre-built registry#54780

Closed
szehon-ho wants to merge 10 commits intoapache:masterfrom
szehon-ho:srid
Closed

[SPARK-55981][SQL] Allow Geo Types with SRID's from the pre-built registry#54780
szehon-ho wants to merge 10 commits intoapache:masterfrom
szehon-ho:srid

Conversation

@szehon-ho
Copy link
Copy Markdown
Member

@szehon-ho szehon-ho commented Mar 13, 2026

What changes were proposed in this pull request?

This is based on PR: #54571, which pre-compiles a list of SRID's from Proj.

Once that is in, we can activate these SRID's in the Geometry and Geography types

We also have some overrides to support OGC standard as well.

Note: some of these changes were from @uros-db initially in #54543, thanks!

Why are the changes needed?

Support standard SRID's for new Geo types

Does this PR introduce any user-facing change?

Geo types not released yet

How was this patch tested?

Add unit tests

Was this patch authored or co-authored using generative AI tooling?

Yes

@szehon-ho szehon-ho force-pushed the srid branch 2 times, most recently from 279e515 to 0022bca Compare March 16, 2026 22:50
* its SRID.
*/
@Test
public void testMapValidity() {
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chat with @uros-db offline, its not worth it to check in a list of all expected srid/crs (which somewhat duplicates the already checked-in registry), so adding just some sanity checks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @szehon-ho!

@szehon-ho szehon-ho force-pushed the srid branch 5 times, most recently from e8db023 to 1f5ebd3 Compare March 21, 2026 01:25
Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just marking this as not ready for merge yet, due to the changes in GeographicSpatialReferenceSystemCache.java. I'm not 100% sure whether this is the best path forward, let's step back and reassess before committing.

@szehon-ho szehon-ho changed the title [SPARK-55981][SQL] Allow Geo Types with SRID's fro the pre-built registry [SPARK-55981][SQL] Allow Geo Types with SRID's from the pre-built registry Mar 26, 2026
Comment thread sql/core/src/test/scala/org/apache/spark/sql/types/GeographyTypeSuite.scala Outdated
- Fix GeographyType.apply(srid) to use actual CRS instead of hardcoded default
- Restore removed null assertions and comments in SpatialReferenceSystemMapperSuite
- Add ESRI:102013 and EPSG:4612 tests for PROJ-sourced registry entries
- Add more valid SRIDs (4267, 4269, 4612) across Geography test suites
- Add more invalid SRIDs (999, 99999) to GeometryTypeSuite
- Update JavaGeographyTypeSuite with more SRIDs and CRS values
- Add Geography round-trip test in GeographyDataFrameSuite
Assertions.assertEquals("OGC:CRS27", GeographicSpatialReferenceSystemMapper.getStringId(4267));
Assertions.assertEquals("OGC:CRS83", GeographicSpatialReferenceSystemMapper.getStringId(4269));
// GEOGRAPHY: PROJ-sourced registry entries (no override).
Assertions.assertEquals("EPSG:4612", GeographicSpatialReferenceSystemMapper.getStringId(4612));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add one geographic ESRI entry, same as above

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added ESRI:37001 as a geographic ESRI entry for GeographicSpatialReferenceSystemMapper.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a few new supported ones here, incl. ESPG and ESRI

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added "geometry(EPSG:3857)", "geometry(EPSG:32601)", and "geometry(ESRI:102013)" to the valid CRS JSON parsing test.

Comment on lines +149 to +171
test("createDataFrame and round-trip with Geometry SRIDs with overrides") {
// GeometryType supports 0, 3857, 4326, 4267, 4269 (OGC overrides).
val geom0 = Geometry.fromWKB(point1, 0)
val geom3857 = Geometry.fromWKB(point2, 3857)
val geom4326 = Geometry.fromWKB(point1, 4326)
val geom4269 = Geometry.fromWKB(point2, 4269)
val schema0 = StructType(Seq(StructField("g", GeometryType(0), nullable = false)))
val schema3857 = StructType(Seq(StructField("g", GeometryType(3857), nullable = false)))
val schema4326 = StructType(Seq(StructField("g", GeometryType(4326), nullable = false)))
val schema4269 = StructType(Seq(StructField("g", GeometryType(4269), nullable = false)))
checkAnswer(
spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom0))), schema0),
Seq(Row(geom0)))
checkAnswer(
spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom3857))), schema3857),
Seq(Row(geom3857)))
checkAnswer(
spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom4326))), schema4326),
Seq(Row(geom4326)))
checkAnswer(
spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom4269))), schema4269),
Seq(Row(geom4269)))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks very similar to the one above, let's just merge them into a single test.

Also, there's a lot of code duplication - seems like we can just parameterize using forach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Merged both tests into a single "createDataFrame and round-trip with Geometry SRIDs" test parameterized with foreach over Seq(0, 3857, 2000, 4326, 4267, 4269, 102100).

Comment on lines +133 to +142
val geog4267 = Geography.fromWKB(point1, 4267)
val geog4269 = Geography.fromWKB(point2, 4269)
val schema4267 = StructType(Seq(StructField("g", GeographyType(4267), nullable = false)))
val schema4269 = StructType(Seq(StructField("g", GeographyType(4269), nullable = false)))
checkAnswer(
spark.createDataFrame(sparkContext.parallelize(Seq(Row(geog4267))), schema4267),
Seq(Row(geog4267)))
checkAnswer(
spark.createDataFrame(sparkContext.parallelize(Seq(Row(geog4269))), schema4269),
Seq(Row(geog4269)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of code duplication - seems like we can just parameterize using forach.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Refactored to use foreach over Seq(4267, 4269), eliminating the per-SRID variable and schema duplication.

Comment on lines -20 to -21
import scala.collection.immutable.Seq

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid unnecessary changes in this PR

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Restored the import scala.collection.immutable.Seq.

Copy link
Copy Markdown
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these changes @szehon-ho, and for addressing the comments!

I left just one more comment, otherwise LGTM. Also, the CI failure looks unrelated.

@uros-db
Copy link
Copy Markdown
Contributor

uros-db commented Mar 30, 2026

@cloud-fan Please take a look.

@szehon-ho
Copy link
Copy Markdown
Member Author

Test failure is not related (RocksDB)

@uros-db
Copy link
Copy Markdown
Contributor

uros-db commented Mar 31, 2026

Test failure is not related (RocksDB)

Yup, this should be good to go. @cloud-fan

Copy link
Copy Markdown
Contributor

@cloud-fan cloud-fan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Replaces the hardcoded 3-entry SRS registry with a CSV-based registry generated from PROJ (~10,642 entries), adds OGC standardization aliases (CRS84/CRS27/CRS83) and the Spark-specific SRID:0 on top, and fixes a latent bug in GeographyType.apply(srid) where the looked-up CRS was ignored in favor of the default.

Clean PR overall — two minor issues noted below.

Comment thread sql/core/src/test/scala/org/apache/spark/sql/GeometryDataFrameSuite.scala Outdated
@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants