Are Apache Sedona geometry functions compatible with Spark Connect?
#1764 opened on Jan 18, 2025
Description
Hello.
We are trying to use geometry data with Apache Spark Connect and Apache Sedona. We are able to convert binary geometry data to Sedona geometry types using ST_GeomFromWKB on a local Apache Sedona instance, but when attempting to do this via a remote Spark Connect server, the ST_GeomFromWKB function is unable to be found (see below error). Are Sedona operations compatible with a Spark Connect server?
pyspark.errors.exceptions.connect.AnalysisException: [UNRESOLVED_ROUTINE] Cannot resolve function `ST_GeomFromWKB` on search path [`system`.`builtin`, `system`.`session`, `spark_catalog`.`default`].; line 1 pos 0
Actual behavior
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from sedona.spark import *
spark = SparkSession.builder.remote("sc://<spark_connect_address>:<port>").getOrCreate()
url = "jdbc:postgresql://<database_address>"
sedona = SedonaContext.create(spark)
df = sedona.read.format("jdbc").option("url", url).option("dbtable", "nyc_neighborhoods").load().withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))
df.show()
Running this code produces the above error at df.show(). When we use Sedona Spark in conjunction with our Spark Connect server without geospatial data (i.e., we don't use .withColumn("geom", f.expr("ST_GeomFromWKB(geom)"))), there is no error; the data is loaded and made available with the geom column in the original binary form.
Note: We are using the PostGIS demo database found here.
Steps to reproduce the problem
- Start the Spark Connect server:
./sbin/start-connect-server.sh --packages org.apache.spark:spark-connect_2.12:3.5.0,org.apache.sedona:sedona-spark-shaded-3.5_2.12:1.7.0,org.datasyslab:geotools-wrapper:1.7.0-28.5,org.postgresql:postgresql:42.7.4 --repositories https://artifacts.unidata.ucar.edu/repository/unidata-all --executor-memory 28G
Settings
Sedona version = 1.7.0
Apache Spark version = 3.5.0
Scala version = 2.12
Python version = 3.8