将字节转换为字符串?

I'm using this code to get standard output from an external program:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]

The communicate() method returns an array of bytes:

>>> command_stdout
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

However, I'd like to work with the output as a normal Python string. So that I could print it like this:

>>> print(command_stdout)
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1
-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2

I thought that's what the binascii.b2a_qp() method is for, but when I tried it, I got the same byte array again:

>>> binascii.b2a_qp(command_stdout)
b'total 0\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file1\n-rw-rw-r-- 1 thomas thomas 0 Mar  3 07:03 file2\n'

Does anybody know how to convert the bytes value back to string? I mean, using the "batteries" instead of doing it manually. And I'd like it to be ok with Python 3.

转载于:https://stackoverflow.com/questions/606191/convert-bytes-to-a-string

16个回答

You need to decode the bytes object to produce a string:

>>> b"abcde"
b'abcde'

# utf-8 is used here because it is a very common encoding, but you
# need to use the encoding your data is actually in.
>>> b"abcde".decode("utf-8") 
'abcde'
csdnceshi53
Lotus@ Does anyone know how to do the same operation in the tensorflow graph?
接近 2 年之前 回复
csdnceshi69
YaoRaoLov UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 168: invalid start byte
大约 2 年之前 回复
weixin_41568131
10.24 While this is generally the way to go, you need to be certain you've got the encoding right, or your code might end up vomiting all over itself. To make it worse, data from the outside world can contain unexpected encodings. The chardet library at pypi.org/project/chardet can help you with this, but again, always program defensively, sometimes even chardet can get it wrong, so wrap your junk with some appropriate Exception handling.
大约 2 年之前 回复
csdnceshi74
7*4 See @borislav-sabev 's answer below. Much better solution.
2 年多之前 回复
weixin_41568134
MAO-EYE I have some code for networking program. and its [def dataReceived(self, data): print(f"Received quote: {data}")] its printing out "received quote: b'\x00&C:\\Users\\.pycharm2016.3\\config\x00&C:\\users\\pycharm\\system\x00\x03--' how would i change my code to fix this. WHen i write print(f"receivedquote: {data}".decode('utf-8') that does not do the trick.
2 年多之前 回复
csdnceshi58
Didn"t forge small update on using sys.stdout.encoding - this is allowed to be None which will cause encode() to fail.
接近 3 年之前 回复
csdnceshi56
lrony* docs.python.org/3.5/library/stdtypes.html#bytes.decode
4 年多之前 回复
csdnceshi78
程序go it's kinda hidden. See answer below for a reference to documentation. It's also in the bytes-docstring (help(command_stdout)).
4 年多之前 回复
csdnceshi70
笑故挽风 : This won’t work on an array like it worked in python2.
接近 5 年之前 回复
csdnceshi64
游.程 Here: docs.python.org/devguide/documenting.html
接近 5 年之前 回复
csdnceshi60
℡Wang Yan If the content is random binary values, the utf-8 conversion is likely to fail. Instead see @techtonik answer (below) stackoverflow.com/a/27527728/198536
大约 5 年之前 回复
csdnceshi75
衫裤跑路 In Python 2.7.6 doesn't handle b"\x80\x02\x03".decode("utf-8") -> UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte.
大约 6 年之前 回复
csdnceshi59
ℙℕℤℝ what other decoding options does the binary object possess?
6 年多之前 回复
csdnceshi80
胖鸭 I've filled a bug about documenting it at bugs.python.org/issue17860 - feel free to propose a patch. If it is hard to contribute - comments how to improve that are welcome.
7 年多之前 回复
csdnceshi71
Memor.の Maybe this will help somebody further: Sometimes you use byte array for e.x. TCP communication. If you want to convert byte array to string cutting off trailing '\x00' characters the following answer is not enough. Use b'example\x00\x00'.decode('utf-8').strip('\x00') then.
7 年多之前 回复
csdnceshi77
狐狸.fox This is the second time I forgot about this and it’s still nowhere to be found in the documentation, not even in the unicode section. What a shame.
7 年多之前 回复
csdnceshi54
hurriedly% Using "windows-1252" is not reliable either (e.g., for other language versions of Windows), wouldn't it be best to use sys.stdout.encoding?
8 年多之前 回复
weixin_41568174
from.. Yes, but given that this is the output from a windows command, shouldn't it instead be using ".decode('windows-1252')" ?
大约 9 年之前 回复
csdnceshi51
旧行李 This 'solution' was particularly hard to find (for me at least) considering it is such a simple problem ... I'd love to put a line somewhere the subprocess docs about this since I bet a good portion of newbies like me will hit this snag when using subprocess. Anybody know about contributing to the python docs?
接近 10 年之前 回复

You need to decode the byte string and turn it in to a character (unicode) string.

b'hello'.decode(encoding)

or

str(b'hello', encoding)
csdnceshi71
Memor.の str(s, 'utf-8') worked for me in Python3
4 年多之前 回复
csdnceshi76
斗士狗 : This doesn’t work with python3.
接近 5 年之前 回复
csdnceshi60
℡Wang Yan Note that the str function in Python 2 (at least 2.7.5 I'm running) doesn't support the second encoding parameter, so it's better to go with the decode method if you want your code to work on Python 2 and 3.
6 年多之前 回复

I think this way is easy:

bytes = [112, 52, 52]
"".join(map(chr, bytes))
>> p44
csdnceshi50
三生石@ For completeness sake: bytes(list_of_integers).decode('ascii') is about 1/3rd faster than ''.join(map(chr, list_of_integers)) on Python 3.6.
大约 2 年之前 回复
csdnceshi53
Lotus@ For python 3 this should be equivalent to bytes([112, 52, 52]) - btw bytes is a bad name for a local variable exactly because it's a p3 builtin
接近 3 年之前 回复
csdnceshi68
local-host this method is a perverted way to express: a.decode('latin-1') where a = bytearray([112, 52, 52]) ("There Ain't No Such Thing as Plain Text". If you've managed to convert bytes into a text string then you used some encoding—latin-1 in this case)
3 年多之前 回复
weixin_41568208
北城已荒凉 It can convert bytes read from a file with "rb" to string, and It's handy when you don't know the encoding
接近 4 年之前 回复
csdnceshi50
三生石@ the title can indeed be misleading, I'll edit.
接近 6 年之前 回复
csdnceshi61
derek5. Pieters Yes. So with that point, this isn't the best answer for the body of the question that was asked. And the title is misleading, isn't it? He/she wants to convert a byte string to a regular string, not a byte array to a string. This answer works okay for the title of the question that was asked.
接近 6 年之前 回复
csdnceshi50
三生石@ you also appear to be talking about integers and bytearrays, not a bytes value (as returned by Popen.communicate()).
接近 6 年之前 回复
csdnceshi61
derek5. Pieters Fair enough. In Python 3.4.1 x86 this method takes 17.01ms, the others 24.02ms, and 11.51ms for the bytearray to string cast. So it's not the fastest in that case.
接近 6 年之前 回复
csdnceshi50
三生石@ yet the OP here is using Python 3.
接近 6 年之前 回复
csdnceshi61
derek5. Pieters I just did a simple benchmark with these other answers, running multiple 10,000 runs stackoverflow.com/a/3646405/353094 And the above solution was actually much faster every single time. For 10,000 runs in Python 2.7.7 it takes 8ms, versus the others at 12ms and 18ms. Granted there could be some variation depending on input, Python version, etc. Doesn't seem too slow to me.
接近 6 年之前 回复
csdnceshi50
三生石@ yet it is terribly inefficient. If you have a byte array you only need to decode.
接近 6 年之前 回复
csdnceshi61
derek5. Thank you, your method worked for me when none other did. I had a non-encoded byte array that I needed turned into a string. Was trying to find a way to re-encode it so I could decode it into a string. This method works perfectly!
大约 6 年之前 回复

I think what you actually want is this:

>>> from subprocess import *
>>> command_stdout = Popen(['ls', '-l'], stdout=PIPE).communicate()[0]
>>> command_text = command_stdout.decode(encoding='windows-1252')

Aaron's answer was correct, except that you need to know WHICH encoding to use. And I believe that Windows uses 'windows-1252'. It will only matter if you have some unusual (non-ascii) characters in your content, but then it will make a difference.

By the way, the fact that it DOES matter is the reason that Python moved to using two different types for binary and text data: it can't convert magically between them because it doesn't know the encoding unless you tell it! The only way YOU would know is to read the Windows documentation (or read it here).

weixin_41568134
MAO-EYE 'latin-1' is a verbatim encoding with all code points set, so you can use that to effectively read a byte string into whichever type of string your Python supports (so verbatim on Python 2, into Unicode for Python 3).
3 年多之前 回复
csdnceshi64
游.程 open() function for text streams or Popen() if you pass it universal_newlines=True do magically decide character encoding for you (locale.getpreferredencoding(False) in Python 3.3+).
6 年多之前 回复

If you don't know the encoding, then to read binary input into string in Python 3 and Python 2 compatible way, use ancient MS-DOS cp437 encoding:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('cp437'))

Because encoding is unknown, expect non-English symbols to translate to characters of cp437 (English chars are not translated, because they match in most single byte encodings and UTF-8).

Decoding arbitrary binary input to UTF-8 is unsafe, because you may get this:

>>> b'\x00\x01\xffsd'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 2: invalid
start byte

The same applies to latin-1, which was popular (default?) for Python 2. See the missing points in Codepage Layout - it is where Python chokes with infamous ordinal not in range.

UPDATE 20150604: There are rumors that Python 3 has surrogateescape error strategy for encoding stuff into binary data without data loss and crashes, but it needs conversion tests [binary] -> [str] -> [binary] to validate both performance and reliability.

UPDATE 20170116: Thanks to comment by Nearoo - there is also a possibility to slash escape all unknown bytes with backslashreplace error handler. That works only for Python 3, so even with this workaround you will still get inconsistent output from different Python versions:

PY3K = sys.version_info >= (3, 0)

lines = []
for line in stream:
    if not PY3K:
        lines.append(line)
    else:
        lines.append(line.decode('utf-8', 'backslashreplace'))

See https://docs.python.org/3/howto/unicode.html#python-s-unicode-support for details.

UPDATE 20170119: I decided to implement slash escaping decode that works for both Python 2 and Python 3. It should be slower that cp437 solution, but it should produce identical results on every Python version.

# --- preparation

import codecs

def slashescape(err):
    """ codecs error handler. err is UnicodeDecode instance. return
    a tuple with a replacement for the unencodable part of the input
    and a position where encoding should continue"""
    #print err, dir(err), err.start, err.end, err.object[:err.start]
    thebyte = err.object[err.start:err.end]
    repl = u'\\x'+hex(ord(thebyte))[2:]
    return (repl, err.end)

codecs.register_error('slashescape', slashescape)

# --- processing

stream = [b'\x80abc']

lines = []
for line in stream:
    lines.append(line.decode('utf-8', 'slashescape'))
csdnceshi57
perhaps? updated the answer. Unfortunately it doesn't work with Python 2 - see stackoverflow.com/questions/25442954/…
3 年多之前 回复
weixin_41568208
北城已荒凉 There is the possibility to leave the escape sequence in the string and move on: b'\x80abc'.decode("utf-8", "backslashreplace") will result in '\\x80abc'. This information was taken from the unicode documentation page which seems to have been updated since the writing of this answer.
3 年多之前 回复
csdnceshi80
胖鸭 You can also just ignore unicode errors with b'\x00\x01\xffsd'.decode('utf-8', 'ignore') in python 3.
大约 4 年之前 回复
csdnceshi57
perhaps? do you mean list? And why it should work on arrays? Especially arrays of floats..
接近 5 年之前 回复
csdnceshi51
旧行李 : This won’t work on an array like it worked in python2.
接近 5 年之前 回复
csdnceshi74
7*4 Brilliant! This is much faster than @Sisso's method for a 256 MB file!
大约 5 年之前 回复
csdnceshi57
perhaps? I really feel like Python should provide a mechanism to replace missing symbols and continue.
5 年多之前 回复

Since this question is actually asking about subprocess output, you have a more direct approach available since Popen accepts an encoding keyword (in Python 3.6+):

>>> from subprocess import Popen, PIPE
>>> text = Popen(['ls', '-l'], stdout=PIPE, encoding='utf-8').communicate()[0]
>>> type(text)
str
>>> print(text)
total 0
-rw-r--r-- 1 wim badger 0 May 31 12:45 some_file.txt

The general answer for other users is to decode bytes to text:

>>> b'abcde'.decode()
'abcde'

With no argument, sys.getdefaultencoding() will be used. If your data is not sys.getdefaultencoding(), then you must specify the encoding explicitly in the decode call:

>>> b'caf\xe9'.decode('cp1250')
'café'

While @Aaron Maenpaa's answer just works, a user recently asked

Is there any more simply way? 'fhand.read().decode("ASCII")' [...] It's so long!

You can use

command_stdout.decode()

decode() has a standard argument

codecs.decode(obj, encoding='utf-8', errors='strict')

Set universal_newlines to True, i.e.

command_stdout = Popen(['ls', '-l'], stdout=PIPE, universal_newlines=True).communicate()[0]
csdnceshi69
YaoRaoLov I've been using this method and it works. Although, it's just guessing at the encoding based on user preferences on your system, so it's not as robust as some other options. This is what it's doing, referencing docs.python.org/3.4/library/subprocess.html: "If universal_newlines is True, [stdin, stdout and stderr] will be opened as text streams in universal newlines mode using the encoding returned by locale.getpreferredencoding(False)."
6 年多之前 回复

If you should get the following by trying decode():

AttributeError: 'str' object has no attribute 'decode'

You can also specify the encoding type straight in a cast:

>>> my_byte_str
b'Hello World'

>>> str(my_byte_str, 'utf-8')
'Hello World'

I made a function to clean a list

def cleanLists(self, lista):
    lista = [x.strip() for x in lista]
    lista = [x.replace('\n', '') for x in lista]
    lista = [x.replace('\b', '') for x in lista]
    lista = [x.encode('utf8') for x in lista]
    lista = [x.decode('utf8') for x in lista]

    return lista
weixin_41568127
?yb? Maybe it saves on allocation but the number of operations would remain the same.
大约 3 年之前 回复
csdnceshi72
谁还没个明天 You can actually chain all of the .strip, .replace, .encode, etc calls in one list comprehension and only iterate over the list once instead of iterating over it five times.
大约 3 年之前 回复
共16条数据 1 尾页
Csdn user default icon
上传中...
上传图片
插入图片
抄袭、复制答案,以达到刷声望分或其他目的的行为,在CSDN问答是严格禁止的,一经发现立刻封号。是时候展现真正的技术了!
立即提问