Multistep Sankey Graph from Dataframe

Question

I have a dataframe with the following structure

INDEX	ANO	DISTRITO	CONCELHO	NCCO
0	2013.0	Aveiro	Albergaria-a-Velha	98
1	2013.0	Aveiro	Albergaria-a-velha	1
2	2013.0	Aveiro	Anadia	41

The full dataset can be found here

This data set ranges from 2013 to 2022 (ANO), and includes 18 different districts (DISTRITO), 278 different counties (CONCELHO) and the number of forest fires per CONCELHO (`NCCO´)

I'm able to produce a one step Sankey graph with this code, that I adapted from here

df = pd.read_csv('heatmap_full.csv') #generated by ingestor.py

all_nodes = df.ANO.values.tolist() + df.DISTRITO.values.tolist() 
source_indices = [all_nodes.index(ANO) for ANO in df.ANO]
target_indices = [all_nodes.index(DISTRITO) for DISTRITO in df.DISTRITO]

colors = px.colors.qualitative.D3
node_colors = [np.random.choice(colors) for node in all_nodes]

fig = go.Figure(data=[go.Sankey(
    # Define nodes
    node = dict(
    pad = 20,
    thickness = 20,
    line = dict(color = "black", width = 1.0),
    label =  all_nodes,
    color =  node_colors,
    ),

    # Add links
    link = dict(
      source =  source_indices,
      target =  target_indices,
      value =  df.NCCO,
))])

fig.update_layout(title_text="FOREST FIRES IN PORTUGAL",
                height = 900,
                width=1200,
                font_size=18)
fig.show()

My Problem/Question

I would like to have a step after DISTRITO for CONCELHO appearing in the Sankey graph, but I can't figure it out.

Can I add a new trace to the figure? Do I need to treat my original dataset in another way?

Any help would be much appreciated

Disclosure This is not meant for commercial use.

Not directly the question, but I doubt a Sankey diagram is the best representation for this type of data. A heatmap or grouped bars would probably be much more informative. — mozway, Feb 12 '22 at 13:25
@mozway appreciate the comment, but this will be part of a larger dashboard that allows users to visualise the datasets in several formats, including heatmaps. — Jorge Gomes, Feb 12 '22 at 13:46

score 1 · Answer 1 · answered Feb 12 '22 at 17:23

reusing this answer to build a Sankey Diagram plotly sankey graph data formatting
build data frame of source and target values. Note two data cleanups
1. there are duplicate CONCELHO due to capitalisation
2. there are same values in CONCELHO and DISTRITO. modify so that no circular items in sankey
as per comments there really are too many nodes to represent in a sankey

import pandas as pd
import numpy as np
import plotly.graph_objects as go

df_in = pd.read_csv("https://raw.githubusercontent.com/vostpt/ICNF_DATA/main/heatmap_full.csv")

# too much data
df_in = df_in.sample(100)

# cleanup where same values exist in two columns
df_in["CONCELHO"] = np.where(df_in["DISTRITO"]==df_in["CONCELHO"], df_in["CONCELHO"]+"_c", df_in["CONCELHO"])
# deal with some duplicates names across source and target...
df_in["CONCELHO"] = df_in["CONCELHO"].str.capitalize()
df = df_in.groupby(["ANO","DISTRITO"], as_index=False)["NCCO"].sum().rename(columns={"ANO":"source","DISTRITO":"target","NCCO":"value"})
df["source"] = df["source"].astype(int).astype(str)

df = pd.concat([df, df_in.groupby(["DISTRITO","CONCELHO"], as_index=False)["NCCO"].sum().rename(columns={"DISTRITO":"source","CONCELHO":"target", "NCCO":"value"})])

nodes = np.unique(df[["source","target"]], axis=None)
nodes = pd.Series(index=nodes, data=range(len(nodes)))

go.Figure(
    go.Sankey(
        node={"label": nodes.index},
        link={
            "source": nodes.loc[df["source"]],
            "target": nodes.loc[df["target"]],
            "value": df["value"],
        },
    )
)

Multistep Sankey Graph from Dataframe

1 Answers1