Identify what slows down the compilation of a lexer

Question

I have a lexer and parser, built with sedlex and menhir in OCaml, to parse spreadsheet formulas.

The following part of the lexer defines regular expressions for the path+workbook+worksheet part before a reference. For instance, 'C:\Users\Pictures\[Book1.xlsx]Sheet1'! of ='C:\Users\Pictures\[Book1.xlsx]Sheet1'!A1:B2.

let first_Latin_identifier_character = [%sedlex.regexp? ('a'..'z') | ('A'..'Z') ]

let path_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | decimal_digit | '_' | '-' | ':' | '\x5C' (* \ *) | ' ' | '&' | '@']
let file_identifier_character = [%sedlex.regexp? first_Latin_identifier_character | decimal_digit | '_' | '-' | ' ' | '.']
let file_suffix = [%sedlex.regexp? ".xls" | ".xlsm" | ".xlsx" | ".XLS" | ".XLSM" | ".XLSX" | ".xlsb" | ".XLSB"]

let sheet_identifier_character_in_quote = [%sedlex.regexp? Compl ('\x3A' | '\x5C' | '\x2F' | '\x3F' | '\x2A' | '\x5B' | '\x5D' | '\x27')]
let sheet_identifier_character_out_quote = [%sedlex.regexp? Compl ('\x3A' | '\x5C' | '\x2F' | '\x3F' | '\x2A' | '\x27' | '\x28' | '\x29' | '\x2B' | '\x2D' | '\x2F' | '\x2C' |'\x3D' | '\x3E' | '\x3C' | '\x3b')]

let lex_file = [%sedlex.regexp? (Star path_identifier_character), '[', (Plus file_identifier_character), file_suffix, ']']
let lex_file_wo_brackets = [%sedlex.regexp? (Star path_identifier_character), (Plus file_identifier_character), file_suffix]
let lex_sheet_in_quote = [%sedlex.regexp? Plus sheet_identifier_character_in_quote]
let lex_file_sheet_in_quote = [%sedlex.regexp? lex_file, lex_sheet_in_quote]

let lex_before = [%sedlex.regexp? 
                    ("'", lex_file_sheet_in_quote, "'!") |
                    ("'", lex_sheet_in_quote, "'!") |
                    (lex_sheet_out_quote, '!') |
                    (lex_file, "!") |
                    (lex_file_wo_brackets, "!") |
                    ("'", lex_file, "'!") |
                    ("'", lex_file_wo_brackets, "'!")]

Without the last 4 cases of lex_before (i.e., (lex_file, "!") | (lex_file_wo_brackets, "!") | ("'", lex_file, "'!") | ("'", lex_file_wo_brackets, "'!")), the total time of compilation (by ocamlc) of the project was 3 minutes 30 seconds (what took time was the compilation of lexer.ml). After adding those 4 cases, the total time of compilation is 13 minutes 40 seconds. What takes time is always the compilation of lexer.ml.

Does anyone know how we could identify what slows down the compilation?

Is there anything wrong in the way I write named regular expressions that slows down the compilation?

Do you have MVP posted somewhere, so we could check the speed on our machines? — Kakadu, Aug 20 '21 at 12:13
I don't know what MVP is... My machine is quite fast: processor: 3,1 GHz Quad-Core Intel Core i7; memory: 16 GB 2133 MHz LPDDR3. — SoftTimur, Aug 20 '21 at 15:41

score 0 · Answer 1 · answered Aug 20 '21 at 04:53

0

When running ocamlopt on mechanically generated code I've found that things go much faster if I specify the -linscan option, which asks for the linear scan register allocator. I assume it helps when there are very many small functions--more than a person would reasonably write, but not more than a program can generate mechanically.

You don't say whether you're compiling to native code, but if so, this might help. It was definitely helpful in my project.

answered Aug 20 '21 at 04:53

Jeffrey Scofield

65,646
2
72
108

I use `ocamlfind ocamlc`, i will add it to my question. Is there any trick around that? – SoftTimur Aug 20 '21 at 04:55
I don't have as much experience with the bytecode compiler, sorry. You might look at the generated code to see if it gives you any ideas. – Jeffrey Scofield Aug 20 '21 at 05:02

score 0 · Answer 2 · answered Aug 20 '21 at 15:48

Hard to say without more info, but the most likely scenario is that sedlex has trouble compiling the regex into code, which usually means you have some complex hard-to-distinguish cases in your expression.

The lexer has constantly remember if it can be in a lex_file or a lex_file_wo_brackets and where it can be in it for example.

Take the word .x.x.xlx.xls.xls.xls! for example. It is a valid file without bracket (with an empty path). Decoding this while knowing where you are requires to build a graph known as a deterministic finite automaton. Building it can take an exponential time especially if you have those kinds of tricky cases.

If you have the opportunity to simplify your language (adding a prefix to every part so that it at least knows in which branch it is), you could divide your compilation time by six. Writing clearer expressions that will be compiled faster can take some training but it's worth it.

Also, switching to ocamlc.opt (or ocamlopt.opt) will make things go much faster.

Identify what slows down the compilation of a lexer

2 Answers2