The main problem here is that Nushell's find mechanics don't "play nice" with the nested table structure of a spreadsheet. What the find command receives from open "huge file.xlsx" is simply one row for each sheet (tab) in the Excel file. It happily scans that row, finds the the text, and yes, returns each sheet as one result set.
as one (JSON ?) string
It's not quite JSON, nor even Nuon; I'm guessing probably just some side effect of the results of find internals when faced with the nested table.
There are several possible ways to get a better result back from find on an Excel file. None of these are highly efficient, however, since they involve scanning the file multiple times. You say the file is "huge", but size is relative, so I'm not quite sure what that means. I'm assuming here that the file isn't so large that it can't be loaded into memory. By loading it into memory before performing the following operations, performance seems to be on par with a "normal", single find.
Simple(?) solution
First, a general Nu solution for this that maintains the existing find functionality. For the sake of brevity in this answer, I'm running it on a smaller file 1 (sample.xlsx listed below), and my search term is foo:
let $excelfile = (open "huge file.xlsx")
$excelfile |
| columns |
| each {
|sheet|
let results = ( $excelfile | get $sheet | enumerate | flatten | find foo )
{
Sheet: $sheet,
Found: $results
}
}
Results below in footnote 2.
Alternative
An alternative solution that returns results closer to the example in your question is a little more complicated. Essentially, we need to "unwrap" each nested layer and scan the end result separately in order to identify the:
- Filename
- Sheet
- Column
- Row
... in which it was found.
def xlsx-find [ fileList phrase ] {
$fileList | each {
|filename|
let t = (open $filename)
let sheets = ($t | columns)
$sheets | each {
|sheetname|
let columns = ($t | get $sheetname | columns )
$columns | each {
|columnname|
let findResults = ($t | get $sheetname | get $columnname | enumerate | find $phrase)
$findResults | get index | each {
|row|
{
file: $filename,
sheet: $sheetname,
column: $columnname,
row: $row
}
}
}
}
} | flatten | flatten | flatten
}
Explanation of key points that might be "tricky":
- Essentially, we're looping through each file provided to the function.
- Obtaining and looping through each Sheet in that Excel file.
- Obtaining and looping through each column in that sheet
enumerateing that column so that we know the row number after non-matching rows are removed
find'ing the text
- Looping through each of the results, taking only the
index (the row number provided by enumerate)
- Returning a record with the desired information
- Each
each statement creates its own table, so at the end, we need to flatten the results one time for each each loop.
xlsx-find [ Book1.xlsx ] foo
Results are a much more compact (and yet still Nu table):
╭───┬────────────┬────────┬─────────┬─────╮
│ # │ file │ sheet │ column │ row │
├───┼────────────┼────────┼─────────┼─────┤
│ 0 │ Book1.xlsx │ Sheet2 │ column0 │ 0 │
│ 1 │ Book1.xlsx │ Sheet2 │ column1 │ 0 │
│ 2 │ Book1.xlsx │ Sheet2 │ column2 │ 0 │
│ 3 │ Book1.xlsx │ Sheet3 │ column0 │ 3 │
│ 4 │ Book1.xlsx │ Sheet3 │ column0 │ 5 │
│ 5 │ Book1.xlsx │ Sheet3 │ column1 │ 4 │
╰───┴────────────┴────────┴─────────┴─────╯
It is, of course, entirely possible to output text based on Nushell string interpolation of the above.
While I haven't gone this far, if you wanted to get some "context" (text before and after each found result), you'd probably use some form of str replace to trim the results.
Performance
I also tested this on a much larger file. I have ~15,0000 Stack Overflow answers (including their Markdown) in an Excel file for studying potential ChatGPT/AI answers. I basically duplicated the 15k rows amongst several sheets and ended up with just over 100k rows in a 20MB file. I don't consider this "huge", but it's substantial. Coincidentally, it also contains a lot of answers with the phrase "todo" in it.
xlsx-find [ hugefile.xlsx ] todo
... took just over 15 seconds (with the timeit command) to return 945 results. This is on par with a normal find. However, if the file was read from disk each time (vs. being loaded into memory in advance), then the scan took just over 1 minute.
Footnotes
1 sample.xlsx:
Sheet1: Empty
Sheet2:
╭───┬─────────┬─────────┬─────────╮
│ # │ column0 │ column1 │ column2 │
├───┼─────────┼─────────┼─────────┤
│ 0 │ Foo1 │ Foo2 │ Foo3 │
╰───┴─────────┴─────────┴─────────╯
Sheet3:
╭───┬─────────┬──────────────────────────╮
│ # │ column0 │ column1 │
├───┼─────────┼──────────────────────────┤
│ 0 │ abc │ 456.00 │
│ 1 │ def │ StackOverflow │
│ 2 │ 123.00 │ SuperUser │
│ 3 │ Foo │ Not found here │
│ 4 │ Bar │ It is foound here though │
│ 5 │ foobar │ │
╰───┴─────────┴──────────────────────────╯
2 Results of first command:
Note: All "foo" finds are highlighted by the find command, but these are not visible in Markdown:
╭───┬────────┬────────────────────────────────────────────╮
│ # │ Sheet │ Found │
├───┼────────┼────────────────────────────────────────────┤
│ 0 │ Sheet1 │ [list 0 items] │
│ 1 │ Sheet2 │ ╭───┬─────────┬─────────┬─────────╮ │
│ │ │ │ # │ column0 │ column1 │ column2 │ │
│ │ │ ├───┼─────────┼─────────┼─────────┤ │
│ │ │ │ 0 │ Foo1 │ Foo2 │ Foo3 │ │
│ │ │ ╰───┴─────────┴─────────┴─────────╯ │
│ 2 │ Sheet3 │ ╭───┬─────────┬──────────────────────────╮ │
│ │ │ │ # │ column0 │ column1 │ │
│ │ │ ├───┼─────────┼──────────────────────────┤ │
│ │ │ │ 3 │ Foo │ Not found here │ │
│ │ │ │ 4 │ Bar │ It is foound here though │ │
│ │ │ │ 5 │ foobar │ │ │
│ │ │ ╰───┴─────────┴──────────────────────────╯ │
╰───┴────────┴────────────────────────────────────────────╯