11.4. Spatial RDD Providers

11.4.1. Accumulo RDD Provider

The AccumuloSpatialRDDProvider is a spatial RDD provider for Accumulo data stores. The core code is in the geomesa-accumulo-spark module, and the shaded JAR-with-dependencies are available in the geomesa-accumulo-spark-runtime-accumulo20 and geomesa-accumulo-spark-runtime-accumulo21 modules.

Note

The GeoMesa Spark runtime JARs are convenient bundles of all the required dependencies for each data store. There are two Accumulo Spark runtime JARs, one for Accumulo 2.0.x (geomesa-accumulo-spark-runtime-accumulo20) and one for Accumulo 2.1.x (geomesa-accumulo-spark-runtime-accumulo21). Make sure that you use the JAR corresponding to your Accumulo version.

This provider can read from and write to a GeoMesa AccumuloDataStore. The configuration parameters are the same as those passed to DataStoreFinder.getDataStore(). See Accumulo Data Store Parameters for details.

The feature type to access in GeoMesa is passed as the type name of the query passed to the rdd() method. For example, to load an RDD of features of type gdelt from the geomesa Accumulo table:

import org.apache.hadoop.conf.Configuration
import org.geotools.api.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark

val params = Map(
  "accumulo.instance.name" -> "mycloud",
  "accumulo.user"          -> "user",
  "accumulo.password"      -> "password",
  "accumulo.zookeepers"    -> "zoo1,zoo2,zoo3",
  "accumulo.catalog"       -> "geomesa")
val query = new Query("gdelt")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

11.4.2. HBase RDD Provider

The HBaseSpatialRDDProvider is a spatial RDD provider for HBase data stores. The core code is in the geomesa-hbase-spark module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the geomesa-hbase-spark-runtime-hbase2 module.

Note

The GeoMesa Spark runtime JARs are convenient bundles of all the required dependencies for each data store.

This provider can read from and write to a GeoMesa HBaseDataStore. The configuration parameters are the same as those passed to DataStoreFinder.getDataStore(). See HBase Data Store Parameters for details.

Note

Connecting to HBase generally requires the hbase-site.xml file to be available on the Spark classpath. This may be accomplished by specifying it with --jars. For example:

$ spark-shell --jars file:///opt/geomesa/dist/spark/geomesa-hbase-spark-runtime-hbase2_${VERSION}.jar,file:///usr/lib/hbase/conf/hbase-site.xml

Alternatively, you may specify the zookeepers in the data store parameter map. However, this may not work for every HBase setup.

The feature type to access in GeoMesa is passed as the type name of the query passed to the rdd() method. For example, to load an RDD of features of type gdelt from the geomesa HBase table:

import org.apache.hadoop.conf.Configuration
import org.geotools.api.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark

val params = Map("hbase.zookeepers" -> "zoo1,zoo2,zoo3", "hbase.catalog" -> "geomesa")
val query = new Query("gdelt")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

11.4.3. FileSystem RDD Provider

The FileSystemRDDProvider is a spatial RDD provider for GeoMesa file system data stores. The core code is in the geomesa-fs-spark module, and the shaded JAR-with-dependencies (which contains all the required dependencies for execution) is available in the geomesa-fs-spark-runtime module.

This provider can read from and write to a GeoMesa FileSystemDataStore. The configuration parameters are the same as those passed to DataStoreFinder.getDataStore(). See FileSystem Data Store Parameters for details.

The feature type to access in GeoMesa is passed as the type name of the query passed to the rdd() method. For example, to load an RDD of features of type gdelt from an s3 bucket:

import org.apache.hadoop.conf.Configuration
import org.geotools.api.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark

val params = Map("fs.path" -> "s3a://mybucket/geomesa/datastore")
val query = new Query("gdelt")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

See FileSystem Data Store Spark SQL Example for an example of using SparkSQL with the FileSystem data store.

11.4.4. Converter RDD Provider

The ConverterSpatialRDDProvider is provided by the geomesa-spark-converter module.

ConverterSpatialRDDProvider reads features from one or more data files in formats readable by the GeoMesa Converters library, including delimited and fixed-width text, Avro, JSON, and XML files. It takes the following configuration parameters:

  • geomesa.converter - the converter definition as a Typesafe Config string

  • geomesa.converter.inputs - input file paths, comma-delimited

  • geomesa.sft - the SimpleFeatureType, as a spec string, configuration string, or environment lookup name

  • geomesa.sft.name - (optional) the name of the SimpleFeatureType

Consider the example data described in the Example Usage section of the GeoMesa Converters documentation. If the file example.csv contains the example data, and example.conf contains the Typesafe configuration file for the converter, the following Scala code can be used to load this data into an RDD:

import com.typesafe.config.ConfigFactory
import org.apache.hadoop.conf.Configuration
import org.geotools.api.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark

val exampleConf = ConfigFactory.load("example.conf").root().render()
val params = Map(
  "geomesa.converter"        -> exampleConf,
  "geomesa.converter.inputs" -> "example.csv",
  "geomesa.sft"              -> "phrase:String,dtg:Date,geom:Point:srid=4326",
  "geomesa.sft.name"         -> "example")
val query = new Query("example")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

It is also possible to load the prepackaged converters for public data sources (GDELT, GeoNames, etc.) via Maven or SBT. See Prepackaged Converter Definitions for more details.

Warning

ConvertSpatialRDDProvider is read-only, and does not support writing features to data files.

11.4.5. GeoTools RDD Provider

GeoToolsSpatialRDDProvider is provided by the geomesa-gt-spark module.

GeoToolsSpatialRDDProvider generates and saves RDDs of features stored in a generic GeoTools DataStore. The configuration parameters passed are the same as those passed to DataStoreFinder.getDataStore() to create the data store of interest, plus a required boolean parameter called “geotools” to indicate to the SPI to load GeoToolsSpatialRDDProvider. For example, to use the Postgis DataStore with GeoMesa Spark, do the following:

import org.apache.hadoop.conf.Configuration
import org.geotools.api.data.Query
import org.locationtech.geomesa.spark.GeoMesaSpark

val params = Map(
  "geotools" -> "true",
  "dbtype"   -> "postgis",
  "host"     -> "localhost",
  "user"     -> "postgres",
  "passwd"   -> "postgres",
  "port"     -> "5432",
  "database" -> "example")
val query = new Query("locations")
val rdd = GeoMesaSpark(params).rdd(new Configuration(), sc, params, query)

The name of the feature type to access in the data store is passed as the type name of the query passed to the rdd() method. In the example above, this is “locations”.

Warning

Do not use the GeoTools RDD provider with a GeoMesa data store that has a provider implementation. The providers described above provide additional optimizations to improve read and write performance.