Spark Sink in Scala
The Spark sink in Scala plugin is available in the Hub.
Executes user-provided Spark code in Scala that operates on an input RDD or Dataframe with full access to all Spark features.
This plugin can be used when you want to have complete control on the Spark computation. For example, you may want to join the input RDD with another Dataset and select a subset of the join result using Spark SQL before writing the results out to files in parquet format.
Configuration
Property | Macro Enabled? | Description |
---|---|---|
Scala | Yes | Required. Spark code in Scala defining how to transform RDD to RDD. The code must implement a function called def sink(df: DataFrame) : DataFrame
def sink(df: DataFrame, context: SparkExecutionPluginContext) : DataFrame The input Operating on lower level def sink(rdd: RDD[StructuredRecord]) : RDD[StructuredRecord]
def sink(rdd: RDD[StructuredRecord], context: SparkExecutionPluginContext) : RDD[StructuredRecord] For example: def sink(rdd: RDD[StructuredRecord], context: SparkExecutionPluginContext) : Unit = {
val outputSchema = context.getOutputSchema
rdd
.flatMap(_.get[String]("body").split("\\s+"))
.map(s => (s, 1))
.reduceByKey(_ + _)
.saveAsTextFile("output")
} This will perform a word count on the input field The following imports are included automatically and are ready for the user code to use: |
Dependencies | Yes | Optional. Extra dependencies for the Spark program. It is a comma separated list of URI for the location of the dependency jars. A path can be ended with an asterisk * as a wildcard, in which all files with extension .jar under the parent path will be included. |
Compile at Deployment Time | No | Optional. Specify whether the code will get validated during pipeline creation time. Setting this to Default is true. |
Created in 2020 by Google Inc.