Judiiii
Judiiii
2021-01-24 21:03

关于hydrate推文id方面的问题

50
  • python
  • twitter

小白一枚,最近在做毕业设计,下载了twitter id

github网址在这里:https://github.com/echen102/us-pres-elections-2020

里面有个hydrate.py,运行的报错是这样的

ERROR:twarc:caught connection error HTTPSConnectionPool(host='api.twitter.com', port=443): Max retries exceeded with url: /1.1/account/verify_credentials.json?tweet_mode=extended (Caused by ConnectTimeoutError(<urllib3.connection.VerifiedHTTPSConnection object at 0x00000204037909E8>, 'Connection to api.twitter.com timed out. (connect timeout=3.05)'))

WARNING:twarc:caught read timeout: HTTPSConnectionPool(host='api.twitter.com', port=443): Read timed out. (read timeout=3.05)

#!/usr/bin/env python3

#
# This script will walk through all the tweet id files and
# hydrate them with twarc. The line oriented JSON files will
# be placed right next to each tweet id file.
#
# Note: you will need to install twarc, tqdm, and run twarc configure
# from the command line to tell it your Twitter API keys.
#
# Special thanks to Github users edsu and SamSamhuns for contributing to this file. This file was repurposed from our other
# data repository on COVID-19 related tweets : https://github.com/echen102/COVID-19-TweetIDs
#

import gzip
import json

from tqdm import tqdm
from twarc import Twarc
from pathlib import Path

twarc = Twarc()
data_dirs = [ '2020-09', '2020-10', '2020-11', '2020-12', '2021-01']


def main():
    for data_dir in data_dirs:
        for path in Path(data_dir).iterdir():
            if path.name.endswith('.txt'):
                hydrate(path)


def _reader_generator(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024 * 1024)


def raw_newline_count(fname):
    """
    Counts number of lines in file
    """
    f = open(fname, 'rb')
    f_gen = _reader_generator(f.raw.read)
    return sum(buf.count(b'\n') for buf in f_gen)


def hydrate(id_file):
    print('hydrating {}'.format(id_file))

    gzip_path = id_file.with_suffix('.jsonl.gz')
    if gzip_path.is_file():
        print('skipping json file already exists: {}'.format(gzip_path))
        return

    num_ids = raw_newline_count(id_file)

    with gzip.open(gzip_path, 'w') as output:
        with tqdm(total=num_ids) as pbar:
            for tweet in twarc.hydrate(id_file.open()):
                output.write(json.dumps(tweet).encode('utf8') + b"\n")
                pbar.update(1)


if __name__ == "__main__":
    main()

 


 
  • 点赞
  • 收藏
  • 复制链接分享

4条回答