Converting files with characters in multiple encodings to utf-8

It’s easy to convert a windows-1252 file into a utf-8 one, right? Use iconv, enca or any other tool of your choice and you’re done. Even if you have thousands of such files, you can easily automate things using Bash, OS X Automator or anything else.

But what should you do if you have characters both encoded in utf-8 and something else? Go find author and kick his ass, obviously. We adopted a snippet found on StackOverflow and got the solution:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import sys
from optparse import OptionParser

last_position = -1
source_encoding = "utf-8"


def mixed_decoder(unicode_error):
    global last_position
    global source_encoding
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode(source_encoding)
    #new_char = u"_"
    return new_char, position + 1


codecs.register_error("mixed", mixed_decoder)

parser = OptionParser()
parser.add_option("-s", "--source-encoding", dest="source_encoding", default="utf-8")
(options, args) = parser.parse_args()

source_encoding = options.source_encoding
target_file = args[0]

if not args:
    print 'Script should be called with valid file name as an argument. The file will be converted to utf-8.'
    sys.exit(1)

try:
    f = open(target_file, 'r')
except IOError:
    print target_file + " is not a valid file"
    sys.exit(1)

s = f.read()
f.close()

s = s.decode("utf-8", "mixed");

f = open(target_file, 'w')
f.write(s.encode("utf-8"))

No one of us is good enough in Python, we feel that code could be better, but it works like a charm.