Following the breadcrumbs, I cobbled some code that seems to do what I want: run in the background, look at ongoing jobs, then collect... whatever information may be available:
def do_background_monitoring(sc: pyspark.context.SparkContext):
    thread = threading.Thread(target=monitor, args=[sc])
    thread.start()
    return thread
def monitor(sc: pyspark.context.SparkContext):
    job_tracker: pyspark.status.StatusTracker = sc.statusTracker() # should this go inside the loop?
    while True:
        time.sleep(1)
        for job_id in job_tracker.getActiveJobsIds():
            job: pyspark.status.SparkJobInfo = job_tracker.getJobInfo(job_id)
            stages = job.stageIds
            # ???
However, that's where I hit a dead end. According to the docs, stageIds is an int[], and apparently py4j or whatever doesn't know what to do with it? (py4j claims otherwise...)
ipdb> stages
JavaObject id=o34
ipdb> stages. 
          equals    notify    wait     
          getClass  notifyAll          
          hashCode  toString           
ipdb> stages.toString()
'[I@4b1009f3'
Is this a dead end? Are there other ways to achieve this? If I were willing to write scala to do this, would I be able to have just this bit be in Scala and keep the rest in Python?
