How do I parse a Tiddlywiki into a bunch of plain text files?

Question

I found a Tiddlywiki plugin to export all the tiddlers into one plain text file, but I want to take my Tiddlywiki file and export the individual tiddlers to individual text files (later to be migrated into notes.vim). Is there an easy way to do this with bash or vim?

The Tiddlywiki file contains formatting/syntax like this:

<div title="Anthony Wallace" creator="Jon" modifier="Jon" created="201104020927" changecount="1" tags="anthropologists, mythology">

and I want to parse the div contents and makes a file called "Anthony Wallace" with the first two lines being:

Anthony Walace

@anthropologists @mythology

peth · Accepted Answer · 2011-05-29T17:56:46.433

This script should do it, under a few assumptions anyway. For example, it will break if attributes in the div tag contain a closing angle bracket (>), if the order of the title and creator attributes changes, or if the div tag spans multiple lines.

#!/usr/bin/awk -f

# treat the opening tag line here
/<div title=".*" creator=".*"/ {
    indiv = 1                                            # inside div from here on
    name = gensub(/.* title="([^"]+)".*/, "\\1", "")     # extract name
    tagsattr = gensub(/.* tags="([^"]+)".*/, "\\1", "")  # extract tags string
    split(tagsattr, tags, /, /)                          # split tags into array

    print(name) > name                                   # print name into file "name"
    for(tag in tags) printf("@%s ", tags[tag]) >> name   # print tags with "@" prefix
    printf("\n\n") >> name                               # two newlines
    sub(/.*<div [^>]+>/, "")                             # remove the tag so the rest
                                                         # of the line can be printed
}

# treat closing line
indiv == 1 && /<\/div>/ {
    sub(/<\/div>.*/, "")                                 # remove tag so the rest
    print >> name                                        # can be printed
    indiv = 0                                            # outside div from here on
}

# print all other lines inside of div
indiv == 1 {
    print >> name
}

chmod +x it and call with input file name as argument. As it is, it will create its output file in the current directory, so be careful.

If your input files are structured in a directory tree, you may have to find the right command line with shell wildcards, loops, or the find utility.

score 1 · Answer 2 · answered Nov 22 '11 at 16:31

Note gensub is a gawk extension to awk so the first line should really be

#!/usr/bin/gawk -f

With some versions of TiddlyWiki the lines look like this (line 4):

/<div title=".*" modifier=".*"/

I wanted to extract all the tiddlers into one html file so I removed all the redirections to the 'name' file and added this top and tail code:

BEGIN { print("<html>") }
END { print("</html>") }

Really helpful code, shows the power of awk! Many thanks, Peter

How do I parse a Tiddlywiki into a bunch of plain text files?

2 Answers2