34

I want to get list of all words from aspell dictionary. I downloaded aspell and aspell polish dictionary, then unziped it using:

preunzip pl.cwl

I got pl.wl:

...
hippie
hippies
hippiesowski/bXxYc
hippika/MNn
hippis/NOqsT
hippisiara/MnN
hippiska/mMN
hippisowski/bXxYc
...

but they appear with sufix like /bXxYc or /MNn. These suffixes are defined in pl_affix.dat, which looks like

...
SFX n Y 5
SFX n   a         0         [^ij]a
SFX n   ja        yj        [^aeijoóuy]ja
SFX n   a         0         [aeijoóuy]ja
SFX n   ia        ij        [^drt]ia
SFX n   ia        yj        [drt]ia
...

It is connected to the declination and conjugation. How can I add to the first list all forms (with all corresponding suffixes as defined in .dat file ) ?

BTW: I need this list to spell-checker jazzy.

Riot
  • 103
rafalmag
  • 487

2 Answers2

40

Give this a try:

aspell -d pl dump master | aspell -l pl expand > my.dict

Edited to match corrections in comment.

8

For some languages, e.g. Italian, expanding is not enough and you will have to do some more processing to get a list of plain words.

This is the command I use to get a list of words in Italian (note that it will take some time to perform):

aspell -d it dump master | aspell -l it expand | sed "s/\w*'//g;s/ \+/\n/g" |
awk '{ print tolower($0) }' | uniq > wordlist.txt

Breaking the pipeline

Aspell expansion:

  • aspell -d it dump master | aspell -l it expand > list1
a
ab
abaco Quell'Abaco quell'Abaco quell'abaco Quest'Abaco quest'Abaco quest'abaco D'Abaco d'Abaco d'abaco Coll'Abaco coll'Abaco coll'abaco Sull'Abaco sull'Abaco sull'abaco Nell'Abaco nell'Abaco nell'abaco Dall'Abaco dall'Abaco dall'abaco Dell'Abaco dell'Abaco dell'abaco All'Abaco all'Abaco all'abaco L'Abaco l'Abaco l'abaco Bell'Abaco bell'Abaco bell'abaco Brav'Abaco brav'Abaco brav'abaco abachi
Abacuc
...

Remove any chars up to an apostrophe (included):

  • sed "s/\w*'//g" list1 > list2
a
ab
abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco Abaco Abaco abaco abachi
Abacuc
...

Break lines on space(s):

  • sed "s/ \+/\n/g" list2 > list3
a
ab
abaco
Abaco
...

Lowercase the whole content in order to use uniq without sorting:

  • awk '{ print tolower($0) }' list3 > list4
a
ab
abaco
abaco
...

Remove duplicates:

  • uniq list4 > list5
a
ab
abaco
abachi
...
etuardu
  • 867