Converting files with characters in multiple encodings to utf-8

It’s easy to convert a windows-1252 file into a utf-8 one, right? Use iconv, enca or any other tool of your choice and you’re done. Even if you have thousands of such files, you can easily automate things using Bash, OS X Automator or anything else.

But what should you do if you have characters both encoded in utf-8 and something else? ~~Go find author and kick his ass, obviously.~~ We adopted a snippet found on StackOverflow and got the solution:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import codecs
import sys
from optparse import OptionParser

last_position = -1
source_encoding = "utf-8"


def mixed_decoder(unicode_error):
    global last_position
    global source_encoding
    string = unicode_error[1]
    position = unicode_error.start
    if position <= last_position:
        position = last_position + 1
    last_position = position
    new_char = string[position].decode(source_encoding)
    #new_char = u"_"
    return new_char, position + 1


codecs.register_error("mixed", mixed_decoder)

parser = OptionParser()
parser.add_option("-s", "--source-encoding", dest="source_encoding", default="utf-8")
(options, args) = parser.parse_args()

source_encoding = options.source_encoding
target_file = args[0]

if not args:
    print 'Script should be called with valid file name as an argument. The file will be converted to utf-8.'
    sys.exit(1)

try:
    f = open(target_file, 'r')
except IOError:
    print target_file + " is not a valid file"
    sys.exit(1)

s = f.read()
f.close()

s = s.decode("utf-8", "mixed");

f = open(target_file, 'w')
f.write(s.encode("utf-8"))

No one of us is good enough in Python, we feel that code could be better, but it works like a charm.

Awesome Roots development team blog

Converting files with characters in multiple encodings to utf-8

Related Posts

Matching IPv6 addresses against CIDR masks in MySQL 29 Oct 2016

Enabling Single Sign On for daemons with PAM and Jasig CAS 05 May 2013

Shaking blocks and scrolling of resizing content 02 May 2013