I have the following code, which gives me an invalid byte sequence error pointing to the scan method in initialize. Any ideas on how to fix this? For what it's worth, the error does not occur when the (.*) between the h1 tag and the closing > is not there.
#!/usr/bin/env ruby
class NewsParser
def initialize
Dir.glob("./**/index.htm") do |file|
@file = IO.read file
parsed = @file.scan(/<h1(.*)>(.*?)<\/h1>(.*)<!-- InstanceEndEditable -->/im)
self.write(parsed)
end
end
def write output
@contents = output
open('output.txt', 'a') do |f|
f << @contents[0][0]+"\n\n"+@contents[0][1]+"\n\n\n\n"
end
end
end
p = NewsParser.new
Edit: Here is the error message:
news_parser.rb:10:in 'scan': invalid byte sequence in UTF-8 (ArgumentError)
SOLVED: The combination of using:
@file = IO.read(file).force_encoding("ISO-8859-1").encode("utf-8", replace: nil)
and
encoding: UTF-8
solve the issue.
Thanks!