I believe the recommended method to get the contents of a large file stored on GitHub is to use REST API. For the files which size is 1MB-100MB, it's only possible to get raw contents (in string format).
I need to use this content to write into a file. If I use pygithub package, I get exactly what I need (in bytes format, and the response object contains encoding field which value is base64). Unfortunately, this package does not work for files which size is greater than 1MB.
So it seems that I only need to find the correct way to convert string to bytes. There are many ways to do it, I have tried 4 so far, and neither matches the output of pygithub package. See the output of several guinea pig files below. How to do the conversion correctly?
from github import Github, ContentFile
import requests
from requests.structures import CaseInsensitiveDict
import base64
token = ...
repo_name = ...
owner = ...
filename = ...
# pygithub method
github_object = Github(token)
github_user = github_object.get_user()
repo = github_user.get_repo(repo_name)
cont_obj = repo.get_contents(filename)
print('encoding', cont_obj.encoding) # prints base64
content_ref = cont_obj.decoded_content # this works correctly for <1MB files
#REST API method
url = f"https://api.github.com/repos/{owner}/{repo_name}/contents/{filename}"
headers = CaseInsensitiveDict()
headers["Accept"] = "application/vnd.github.v3.raw"
headers["Authorization"] = f"Bearer {token}"
headers["X-GitHub-Api-Version"] = "2022-11-28"
contents_str = requests.get(url, headers=headers).text
contents = []
# https://stackoverflow.com/questions/72037211/how-to-convert-a-base64-file-to-bytes
contents.append(base64.b64decode(contents_str.encode() + b'=='))
# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
contents.append(bytes(contents_str, encoding="raw_unicode_escape"))
# https://stackoverflow.com/questions/7585435/best-way-to-convert-string-to-bytes-in-python-3
message_bytes = contents_str.encode('utf-8')
contents.append(base64.b64encode(message_bytes))
#contents.append(base64.decodebytes(message_bytes + b'==')) same as method 0
print(type(content_ref), len(content_ref), content_ref[:50])
for i, c in enumerate(contents):
print(i, type(c), len(c), c[:50])
The output of the guinea pig files:
The text file that contains
tiny textallows telling that all but method 1 are incorrect<class 'bytes'> 10 b'tiny text\n'
0 <class 'bytes'> 6 b'\xb6)\xf2\xb5\xecm'
1 <class 'bytes'> 10 b'tiny text\n'
2 <class 'bytes'> 16 b'dGlueSB0ZXh0Cg=='
for this pdf file, the length of the output of method 1 is slightly bigger, and the contents is slightly
<class 'bytes'> 3028 b'%PDF-1.3\r\n%\xe2\xe3\xcf\xd3\r\n\r\n1 0 obj\r\n<<\r\n/Type /Catalog\r\n/O'
0 <class 'bytes'> 1504 b'<1u\xdf](n?\xd3\xca\x97\xbf\t\xabZ\x96\x88?:\xebe\x8aw\xac\xdbD\x7f=\xa8\x1e\xb3}\x11zwhn=\xb4\xa1\xb8\xffO*^\xfc\xeb\xad\x96)'
1 <class 'bytes'> 3048 b'%PDF-1.3\r\n%\ufffd\ufffd\ufffd\ufffd\r\n\r\n1 0 obj\r\n<<'
2 <class 'bytes'> 4048 b'JVBERi0xLjMNCiXvv73vv73vv73vv70NCg0KMSAwIG9iag0KPD'
For this image, the size and the contents of output 1 are very different
<class 'bytes'> 57270 b'GIF89a\xfa\x00)\x01\xe7\xff\x00\x06\t\r\x0f\n\x08\x19\r\x0c \x0e\n,\x12\x0b"\x18\x17\x1f\x1a\x16"\x1b\x12&\x18\x18&!\x17%!\x1b* \x1d,'
0 <class 'bytes'> 329 b'\x18\x81|\xf5\xa17\xebo\xb6\xf3^\x1b\x03]\x02\xdb\xcd\x04\xfc\x8e\xc7\xd8\xb1D\x0c\xa36\xe8\xd3\x00\xbf\x9e\x94\xf5$\xbcT\x04D\xf9\x11\xfa\_U\x14\x05\xd5\xfce'
1 <class 'bytes'> 169079 b'GIF89a\ufffd\x00)\x01\ufffd\ufffd\x00\x06\t\r\x0f\n\x08\x19\r\x0c \x0e\n,\x12\x0b"\x18\x17\x1f\x1a\x16"'
2 <class 'bytes'> 132656 b'R0lGODlh77+9ACkB77+977+9AAYJDQ8KCBkNDCAOCiwSCyIYFx'