How to parse a JSON column in Spark
Suppose you have a table with a string column that has JSON data encoded in it, like this one below, and you want to parse and query on the data of this column.
val df = Seq(
("{\"id\":1,\"amount\":{\"value\":60,\"currency\":\"BRL\"}}"),
("{\"id\":2,\"amount\":{\"value\":70,\"currency\":\"USD\"}}")
).toDF("my_json_col")
First step is to define the json schema (I will use Scala code):
import org.apache.spark.sql.types._val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("amount", StructType(Seq(
StructField("value", IntegerType),
StructField("currency", StringType))))
))
Notice that our json has nested keys; “value” and “currency” are inside “amount” key.
With the schema, now we need to parse the json, using the from_json function. This will turn the json string into a Map object, mapping every key to its value.
val parsedDf = df.withColumn(“parsed”, from_json(col(“my_json_col”), schema))
Now, it is possible to query any field of our DataFrame. You access the fields by doing a dot notation <colname>.<key>
; if it’s a nested key, just keep adding the key names after each dot. Examples:
parsedDf.filter(col("parsed.id") === 1)
parsedDf.filter(col("parsed.amount.value") > 50)
And if I want to create a column for each field?
You can also use this dot notation with select
; this way, new columns will be created for each field. Note that here you can also use wildcard * for listing all children of a given key.
parsedDf.select("parsed.id", "parsed.amount.*")
Summary
- Define json schema;
- Use from_json function to parse json string into a Map;
- Query fields using dot notation
<colname>.<key>
; - Use
select
to create columns for each field.