[SPARK-55981][SQL] Allow Geo Types with SRID's from the pre-built registry#54780
[SPARK-55981][SQL] Allow Geo Types with SRID's from the pre-built registry#54780szehon-ho wants to merge 10 commits intoapache:masterfrom
Conversation
279e515 to
0022bca
Compare
| * its SRID. | ||
| */ | ||
| @Test | ||
| public void testMapValidity() { |
There was a problem hiding this comment.
chat with @uros-db offline, its not worth it to check in a list of all expected srid/crs (which somewhat duplicates the already checked-in registry), so adding just some sanity checks
e8db023 to
1f5ebd3
Compare
uros-db
left a comment
There was a problem hiding this comment.
Just marking this as not ready for merge yet, due to the changes in GeographicSpatialReferenceSystemCache.java. I'm not 100% sure whether this is the best path forward, let's step back and reassess before committing.
- Fix GeographyType.apply(srid) to use actual CRS instead of hardcoded default - Restore removed null assertions and comments in SpatialReferenceSystemMapperSuite - Add ESRI:102013 and EPSG:4612 tests for PROJ-sourced registry entries - Add more valid SRIDs (4267, 4269, 4612) across Geography test suites - Add more invalid SRIDs (999, 99999) to GeometryTypeSuite - Update JavaGeographyTypeSuite with more SRIDs and CRS values - Add Geography round-trip test in GeographyDataFrameSuite
| Assertions.assertEquals("OGC:CRS27", GeographicSpatialReferenceSystemMapper.getStringId(4267)); | ||
| Assertions.assertEquals("OGC:CRS83", GeographicSpatialReferenceSystemMapper.getStringId(4269)); | ||
| // GEOGRAPHY: PROJ-sourced registry entries (no override). | ||
| Assertions.assertEquals("EPSG:4612", GeographicSpatialReferenceSystemMapper.getStringId(4612)); |
There was a problem hiding this comment.
Let's also add one geographic ESRI entry, same as above
There was a problem hiding this comment.
Done. Added ESRI:37001 as a geographic ESRI entry for GeographicSpatialReferenceSystemMapper.
There was a problem hiding this comment.
Let's add a few new supported ones here, incl. ESPG and ESRI
There was a problem hiding this comment.
Done. Added "geometry(EPSG:3857)", "geometry(EPSG:32601)", and "geometry(ESRI:102013)" to the valid CRS JSON parsing test.
| test("createDataFrame and round-trip with Geometry SRIDs with overrides") { | ||
| // GeometryType supports 0, 3857, 4326, 4267, 4269 (OGC overrides). | ||
| val geom0 = Geometry.fromWKB(point1, 0) | ||
| val geom3857 = Geometry.fromWKB(point2, 3857) | ||
| val geom4326 = Geometry.fromWKB(point1, 4326) | ||
| val geom4269 = Geometry.fromWKB(point2, 4269) | ||
| val schema0 = StructType(Seq(StructField("g", GeometryType(0), nullable = false))) | ||
| val schema3857 = StructType(Seq(StructField("g", GeometryType(3857), nullable = false))) | ||
| val schema4326 = StructType(Seq(StructField("g", GeometryType(4326), nullable = false))) | ||
| val schema4269 = StructType(Seq(StructField("g", GeometryType(4269), nullable = false))) | ||
| checkAnswer( | ||
| spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom0))), schema0), | ||
| Seq(Row(geom0))) | ||
| checkAnswer( | ||
| spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom3857))), schema3857), | ||
| Seq(Row(geom3857))) | ||
| checkAnswer( | ||
| spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom4326))), schema4326), | ||
| Seq(Row(geom4326))) | ||
| checkAnswer( | ||
| spark.createDataFrame(sparkContext.parallelize(Seq(Row(geom4269))), schema4269), | ||
| Seq(Row(geom4269))) | ||
| } |
There was a problem hiding this comment.
This test looks very similar to the one above, let's just merge them into a single test.
Also, there's a lot of code duplication - seems like we can just parameterize using forach.
There was a problem hiding this comment.
Done. Merged both tests into a single "createDataFrame and round-trip with Geometry SRIDs" test parameterized with foreach over Seq(0, 3857, 2000, 4326, 4267, 4269, 102100).
| val geog4267 = Geography.fromWKB(point1, 4267) | ||
| val geog4269 = Geography.fromWKB(point2, 4269) | ||
| val schema4267 = StructType(Seq(StructField("g", GeographyType(4267), nullable = false))) | ||
| val schema4269 = StructType(Seq(StructField("g", GeographyType(4269), nullable = false))) | ||
| checkAnswer( | ||
| spark.createDataFrame(sparkContext.parallelize(Seq(Row(geog4267))), schema4267), | ||
| Seq(Row(geog4267))) | ||
| checkAnswer( | ||
| spark.createDataFrame(sparkContext.parallelize(Seq(Row(geog4269))), schema4269), | ||
| Seq(Row(geog4269))) |
There was a problem hiding this comment.
There's a lot of code duplication - seems like we can just parameterize using forach.
There was a problem hiding this comment.
Done. Refactored to use foreach over Seq(4267, 4269), eliminating the per-SRID variable and schema duplication.
…, merge/deduplicate round-trip tests
| import scala.collection.immutable.Seq | ||
|
|
There was a problem hiding this comment.
Let's avoid unnecessary changes in this PR
There was a problem hiding this comment.
Done. Restored the import scala.collection.immutable.Seq.
uros-db
left a comment
There was a problem hiding this comment.
Thank you for these changes @szehon-ho, and for addressing the comments!
I left just one more comment, otherwise LGTM. Also, the CI failure looks unrelated.
|
@cloud-fan Please take a look. |
|
Test failure is not related (RocksDB) |
Yup, this should be good to go. @cloud-fan |
cloud-fan
left a comment
There was a problem hiding this comment.
Summary
Replaces the hardcoded 3-entry SRS registry with a CSV-based registry generated from PROJ (~10,642 entries), adds OGC standardization aliases (CRS84/CRS27/CRS83) and the Spark-specific SRID:0 on top, and fixes a latent bug in GeographyType.apply(srid) where the looked-up CRS was ignored in favor of the default.
Clean PR overall — two minor issues noted below.
|
thanks, merging to master! |
What changes were proposed in this pull request?
This is based on PR: #54571, which pre-compiles a list of SRID's from Proj.
Once that is in, we can activate these SRID's in the Geometry and Geography types
We also have some overrides to support OGC standard as well.
Note: some of these changes were from @uros-db initially in #54543, thanks!
Why are the changes needed?
Support standard SRID's for new Geo types
Does this PR introduce any user-facing change?
Geo types not released yet
How was this patch tested?
Add unit tests
Was this patch authored or co-authored using generative AI tooling?
Yes