How to parse a JSON column in Spark

2 min readFeb 19, 2021

Suppose you have a table with a string column that has JSON data encoded in it, like this one below, and you want to parse and query on the data of this column.

val df = Seq(
      ("{\"id\":1,\"amount\":{\"value\":60,\"currency\":\"BRL\"}}"),
      ("{\"id\":2,\"amount\":{\"value\":70,\"currency\":\"USD\"}}")
    ).toDF("my_json_col")

Dataframe with a string column that has JSON data

First step is to define the json schema (I will use Scala code):

import org.apache.spark.sql.types._val schema = StructType(Seq(
               StructField("id", IntegerType),
               StructField("amount", StructType(Seq(
                 StructField("value", IntegerType),
                 StructField("currency", StringType))))
             ))

Notice that our json has nested keys; “value” and “currency” are inside “amount” key.

With the schema, now we need to parse the json, using the from_json function. This will turn the json string into a Map object, mapping every key to its value.

val parsedDf = df.withColumn(“parsed”, from_json(col(“my_json_col”), schema))

Now, it is possible to query any field of our DataFrame. You access the fields by doing a dot notation <colname>.<key> ; if it’s a nested key, just keep adding the key names after each dot. Examples:

parsedDf.filter(col("parsed.id") === 1)
parsedDf.filter(col("parsed.amount.value") > 50)

And if I want to create a column for each field?

You can also use this dot notation with select ; this way, new columns will be created for each field. Note that here you can also use wildcard * for listing all children of a given key.

parsedDf.select("parsed.id", "parsed.amount.*")

Summary

Define json schema;
Use from_json function to parse json string into a Map;
Query fields using dot notation <colname>.<key>;
Use select to create columns for each field.

How to parse a JSON column in Spark

And if I want to create a column for each field?

Summary

Written by Marcelo Li Koga