How to parse a JSON column in Spark

Marcelo Li Koga
2 min readFeb 19, 2021

--

Sample of json text

Suppose you have a table with a string column that has JSON data encoded in it, like this one below, and you want to parse and query on the data of this column.

val df = Seq(
("{\"id\":1,\"amount\":{\"value\":60,\"currency\":\"BRL\"}}"),
("{\"id\":2,\"amount\":{\"value\":70,\"currency\":\"USD\"}}")
).toDF("my_json_col")
Dataframe with a string column that has JSON data

First step is to define the json schema (I will use Scala code):

import org.apache.spark.sql.types._val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("amount", StructType(Seq(
StructField("value", IntegerType),
StructField("currency", StringType))))
))

Notice that our json has nested keys; “value” and “currency” are inside “amount” key.

With the schema, now we need to parse the json, using the from_json function. This will turn the json string into a Map object, mapping every key to its value.

val parsedDf = df.withColumn(“parsed”, from_json(col(“my_json_col”), schema))

Now, it is possible to query any field of our DataFrame. You access the fields by doing a dot notation <colname>.<key> ; if it’s a nested key, just keep adding the key names after each dot. Examples:

parsedDf.filter(col("parsed.id") === 1)
parsedDf.filter(col("parsed.amount.value") > 50)

And if I want to create a column for each field?

You can also use this dot notation with select ; this way, new columns will be created for each field. Note that here you can also use wildcard * for listing all children of a given key.

parsedDf.select("parsed.id", "parsed.amount.*")

Summary

  1. Define json schema;
  2. Use from_json function to parse json string into a Map;
  3. Query fields using dot notation <colname>.<key>;
  4. Use select to create columns for each field.

--

--

Marcelo Li Koga
Marcelo Li Koga

Written by Marcelo Li Koga

Software Engineer and Data Scientist

No responses yet