nushell: open excel files and search them for a text

Question

I would like to use Nu to open several Excel files, search them for text and print the line of each match along with its location. For example:

hugh file.xsls, sheet 123, line 98765: ... todo ...

To open a single file, I tried:

open "huge file.xlsx" | find todo

But the result is a single line containing the entire Excel file as one (JSON ?) string. I find it super helpful that Nu can open Excel files. I just need a way to get a better search result display. :-)

This question is about Nu, not any other shell, programming language or tool that can achieve that. :-)

score 5 · Accepted Answer · answered Oct 16 '23 at 12:18

The main problem here is that Nushell's find mechanics don't "play nice" with the nested table structure of a spreadsheet. What the find command receives from open "huge file.xlsx" is simply one row for each sheet (tab) in the Excel file. It happily scans that row, finds the the text, and yes, returns each sheet as one result set.

as one (JSON ?) string

It's not quite JSON, nor even Nuon; I'm guessing probably just some side effect of the results of find internals when faced with the nested table.

There are several possible ways to get a better result back from find on an Excel file. None of these are highly efficient, however, since they involve scanning the file multiple times. You say the file is "huge", but size is relative, so I'm not quite sure what that means. I'm assuming here that the file isn't so large that it can't be loaded into memory. By loading it into memory before performing the following operations, performance seems to be on par with a "normal", single find.

Simple(?) solution

First, a general Nu solution for this that maintains the existing find functionality. For the sake of brevity in this answer, I'm running it on a smaller file ¹ (sample.xlsx listed below), and my search term is foo:

let $excelfile = (open "huge file.xlsx")
$excelfile | 
| columns |
| each {
    |sheet|
    let results = ( $excelfile | get $sheet | enumerate | flatten | find foo )
    {
      Sheet: $sheet,
      Found: $results
    }
}

Results below in footnote ².

Alternative

An alternative solution that returns results closer to the example in your question is a little more complicated. Essentially, we need to "unwrap" each nested layer and scan the end result separately in order to identify the:

Filename
Sheet
Column
Row

... in which it was found.

def xlsx-find [ fileList phrase ] {
  $fileList | each {
    |filename|
    let t = (open $filename)
    let sheets = ($t | columns)
    $sheets | each {
      |sheetname|
      let columns = ($t | get $sheetname | columns )
      $columns | each {
        |columnname|
        let findResults = ($t | get $sheetname | get $columnname | enumerate | find $phrase)
        $findResults | get index | each {
          |row|
          {
            file: $filename,
            sheet: $sheetname,
            column: $columnname,
            row: $row
          }
        }
      }
    }
  } | flatten | flatten | flatten
}

Explanation of key points that might be "tricky":

Essentially, we're looping through each file provided to the function.
Obtaining and looping through each Sheet in that Excel file.
Obtaining and looping through each column in that sheet
enumerateing that column so that we know the row number after non-matching rows are removed
find'ing the text
Looping through each of the results, taking only the index (the row number provided by enumerate)
Returning a record with the desired information
Each each statement creates its own table, so at the end, we need to flatten the results one time for each each loop.

xlsx-find [ Book1.xlsx ] foo

Results are a much more compact (and yet still Nu table):

╭───┬────────────┬────────┬─────────┬─────╮
│ # │    file    │ sheet  │ column  │ row │
├───┼────────────┼────────┼─────────┼─────┤
│ 0 │ Book1.xlsx │ Sheet2 │ column0 │   0 │
│ 1 │ Book1.xlsx │ Sheet2 │ column1 │   0 │
│ 2 │ Book1.xlsx │ Sheet2 │ column2 │   0 │
│ 3 │ Book1.xlsx │ Sheet3 │ column0 │   3 │
│ 4 │ Book1.xlsx │ Sheet3 │ column0 │   5 │
│ 5 │ Book1.xlsx │ Sheet3 │ column1 │   4 │
╰───┴────────────┴────────┴─────────┴─────╯

It is, of course, entirely possible to output text based on Nushell string interpolation of the above.

While I haven't gone this far, if you wanted to get some "context" (text before and after each found result), you'd probably use some form of str replace to trim the results.

Performance

I also tested this on a much larger file. I have ~15,0000 Stack Overflow answers (including their Markdown) in an Excel file for studying potential ChatGPT/AI answers. I basically duplicated the 15k rows amongst several sheets and ended up with just over 100k rows in a 20MB file. I don't consider this "huge", but it's substantial. Coincidentally, it also contains a lot of answers with the phrase "todo" in it.

xlsx-find [ hugefile.xlsx ] todo

... took just over 15 seconds (with the timeit command) to return 945 results. This is on par with a normal find. However, if the file was read from disk each time (vs. being loaded into memory in advance), then the scan took just over 1 minute.

Footnotes

¹ sample.xlsx:

Sheet1: Empty

Sheet2:

╭───┬─────────┬─────────┬─────────╮
│ # │ column0 │ column1 │ column2 │
├───┼─────────┼─────────┼─────────┤
│ 0 │ Foo1    │ Foo2    │ Foo3    │
╰───┴─────────┴─────────┴─────────╯

Sheet3:

╭───┬─────────┬──────────────────────────╮
│ # │ column0 │         column1          │
├───┼─────────┼──────────────────────────┤
│ 0 │ abc     │                   456.00 │
│ 1 │ def     │ StackOverflow            │
│ 2 │  123.00 │ SuperUser                │
│ 3 │ Foo     │ Not found here           │
│ 4 │ Bar     │ It is foound here though │
│ 5 │ foobar  │                          │
╰───┴─────────┴──────────────────────────╯

² Results of first command:

Note: All "foo" finds are highlighted by the find command, but these are not visible in Markdown:

╭───┬────────┬────────────────────────────────────────────╮
│ # │ Sheet  │                   Found                    │
├───┼────────┼────────────────────────────────────────────┤
│ 0 │ Sheet1 │ [list 0 items]                             │
│ 1 │ Sheet2 │ ╭───┬─────────┬─────────┬─────────╮        │
│   │        │ │ # │ column0 │ column1 │ column2 │        │
│   │        │ ├───┼─────────┼─────────┼─────────┤        │
│   │        │ │ 0 │ Foo1    │ Foo2    │ Foo3    │        │
│   │        │ ╰───┴─────────┴─────────┴─────────╯        │
│ 2 │ Sheet3 │ ╭───┬─────────┬──────────────────────────╮ │
│   │        │ │ # │ column0 │         column1          │ │
│   │        │ ├───┼─────────┼──────────────────────────┤ │
│   │        │ │ 3 │ Foo     │ Not found here           │ │
│   │        │ │ 4 │ Bar     │ It is foound here though │ │
│   │        │ │ 5 │ foobar  │                          │ │
│   │        │ ╰───┴─────────┴──────────────────────────╯ │
╰───┴────────┴────────────────────────────────────────────╯

nushell: open excel files and search them for a text

1 Answers1

Simple(?) solution

Alternative

Performance

Footnotes