I am trying to write a method to remove some blacklisted characters like bom characters using their UTF-8 values. I am successful to achieve this by creating a method in String class with the following logic,
def remove_blacklist_utf_chars
self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "")
self
end
Now to make it useful across the applications and reusable I create a config in a yml file. The yml structure is something like,
:blacklist_utf_chars:
:zero_width_space: '"\u{200b}"'
(Edit) Also as suggested by Drenmi this didn't work,
:blacklist_utf_chars:
:zero_width_space: \u{200b}
The problem I am facing is that the method remove_blacklist_utf_chars does not work when I load the utf-encoding of blacklist characters from yml file
But when I directly pass these in the method and not via the yml file the method works.
So basically
self.force_encoding("UTF-8").gsub!("\u{200b}".force_encoding("UTF-8"), "") -- works.
but,
self.force_encoding("UTF-8").gsub!(config[:blacklist_utf_chars][:zero_width_space].force_encoding("UTF-8"), "") -- doesn't work.
I printed the value of config[:blacklist_utf_chars][:zero_width_space] and its equal to "\u{200b}"
I got this idea by referring: https://stackoverflow.com/a/5011768/2362505.
Now I am not sure how what exactly is happening when the blacklist chars list is loaded via yml in ruby code.
EDIT 2:
On further investigation I observed that there is an extra \ getting added while reading the hash from the yaml.
So,
puts config[:blacklist_utf_chars][:zero_width_space].dump
prints:
"\\u{200b}"
But then if I just define the yaml as:
:blacklist_utf_chars:
:zero_width_space: 200b
and do,
ch = "\u{#{config[:blacklist_utf_chars][:zero_width_space]}}"
self.force_encoding("UTF-8").gsub!(ch.force_encoding("UTF-8"), "")
I get
/Users/harshsingh/dir/to/code/utils.rb:121: invalid Unicode escape (SyntaxError)