-
-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allow users to gather tweets from a shared user timeline #99
Comments
It is indeed quite a drastic change that would need a bit of work and time to integrate nicely. We could optimize a little bit more the gathering of tweets but even then, after scaling up to a certain number of accounts you'll run into Twitter's API rate limits again eventually. I'm wondering if in the meantime you could mitigate it somewhat by using RSS feeds as the account source for tweets instead of Twitter's API: |
This project won't be connecting to any protected accounts anyway, and polls may not even matter since there's no way for that data to get back onto Twitter, so those are probably no biggie; I'll look at the RSS option. If the API still supports batching, that would obviously be more desirable in the long term. I haven't seen their API directly in something like 10 years though, so my ideas on how it works are way out of date. (The temporary solution I looked at first is from my other issue today: running multiple bot instances connected to different API apps. It looks like you've already done that one, so yay! :-) |
Ugh; it looks like the standard API doesn't have a way to make a batch request. Search might work, but a simpler approach would be: a) have a global setting in the bot config for a user whose timeline should be scraped for tweets (since "fetch a user timeline" is a single query) before processing the configured users which would then be broken out by what user tweeted them if that user matches one in the bot config b) have a per-user setting in the bot config that says whether to fetch this user's tweets separately or to use tweets from the globally configured timeline That way I could set up a single Twitter account that follows all the accounts I want batched. It would also maybe minimize the impact on the bot's processing path; it makes me do the work of setting up a timeline to scrape, so the optimization workload is on me instead of you. :-) |
Oh, I forgot to mention if you're gonna try the RSS feature maybe do so on the latest rc version ( |
Hmmm, I'm a little torn about this. How do you envision a config looking like if using the timeline approach you suggest, something along the lines of this?: pleroma_base_url: https://pleroma.instance
max_tweets: 40
timeline_user: TwitterUserFollowingAccounts
twitter_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
users:
- twitter_username: User1
pleroma_username: MyPleromaUser1
pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
use_timeline: true
- twitter_username: User2
pleroma_username: MyPleromaUser2
pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
|
Yeah, that's pretty much what I was thinking. "shared_timeline_user" and "use_shared_timeline" might be more clear? |
Excellent. Yeah, the names are subject to change, just wanted to make sure I understood what you were going for. On a related note, I've been also experimenting with Guest Tokens as another way of circumventing Twitter's API rate limits (and for people who don't want to apply for a dev account): If you have no pleroma_base_url: https://pleroma.instance
twitter_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
users:
- twitter_username: User1
pleroma_username: MyPleromaUser1
pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
guest: true # <--- It's limited to 20 tweets (or I haven't figured out how to force it to paginate with the cursor yet). You can try it for yourself by installing |
If that works, it should do the trick. I converted my config to guest and ran with no rate-limiter hits, though I do seem to have gotten re-posts of recent tweets on some accounts. For example: https://twitter.oksocial.net/loresjoberg (In total, it looks like maybe 10 out of 46 accounts ended up with a re-post.) |
That did not go well. :-) I'm running with a script that rebuilds the bot config files, runs the bot then sleeps 5 minutes; in that setup, it ran one pass successfully as guest, then all subsequent runs got this for all accounts: Error logℹ 2022-11-28 10:00:35,631 - pleroma_bot - INFO - ======================================
INFO:pleroma_bot:======================================
ℹ 2022-11-28 10:00:35,631 - pleroma_bot - INFO - Processing user: adamconover
INFO:pleroma_bot:Processing user: adamconover
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): api.twitter.com:443
DEBUG:urllib3.connectionpool:https://api.twitter.com:443 "POST /1.1/guest/activate.json HTTP/1.1" 429 69
✖ 2022-11-28 10:00:35,775 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:700)
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 539, in main
user = User(user_item, config, base_path, posts_ids)
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 205, in __init__
guest_token, headers = self._get_guest_token_header()
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 1085, in _get_guest_token_header
guest_token = json_resp['guest_token']
KeyError: 'guest_token'
ERROR:pleroma_bot:Exception occurred for user, skipping...
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 539, in main
user = User(user_item, config, base_path, posts_ids)
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 205, in __init__
guest_token, headers = self._get_guest_token_header()
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 1085, in _get_guest_token_header
guest_token = json_resp['guest_token']
KeyError: 'guest_token' |
Looks like it hit a 429 when requesting a guest token: How many users would you say you run it with in the span of 15min? I may try to replicate it on my side too. |
There are 46 accounts on it at the moment, so that'd presumably be 92 to 138 attempts depending on how the timing goes? |
After some testing, if I randomize the user agent slightly I'm getting 1000 requests for a new guest token before getting rate limited. In addition to that, I've also added retrying with proxies once you've hit an 429. This really only helps when using guest tokens (with an app token your request count goes up no matter what the source IP happens to be): They are configurable with the proxy_pool:
- 128.199.221.6:443
- 164.62.72.90:80
- 178.128.121.196:443
pleroma_base_url: https://pleroma.instance
users:
- twitter_username: User1
pleroma_username: MyPleromaUser1
pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
- twitter_username: User1
pleroma_username: MyPleromaUser1
proxy: false
pleroma_token: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX Hopefully that would help alleviate your rate limit issue a bit, these changes are included in |
Thanks; I'll give that rc a try. |
Oh and by the way, if I had to guess the re-posts probably were due to some timestamps not being transformed correctly to UTC format. This change is included in |
That didn't work; I left it running unattended for an hour and it doesn't seem to have hit the rate limiter but also didn't post anything. When I pulled guest:true out of the config it caught up with what it had missed. This was with rc35, so I'll re-try in a bit with rc37. Thanks. |
Timezones are fun, I had to force it to UTC otherwise it would use the local timezone when parsing the start date into an UTC epoch timestamp: If it still happens on |
rc38 is doing better; it doesn't seem to be missing tweets. I see it doing the rollover to public proxies:
That's a neat feature, but for my project I'm not happy about depending on someone else's proxy; I wouldn't want to cause anyone else trouble. That's my problem to deal with, though. :-) This seems to be viable for running every 5 minutes at the moment. I do think that batching tweets from a user timeline is a better strategy in the long run, but this fix is working for now. Thanks. |
For sure, this was meant just as an stopgap for your usecase because batching and user timelines will take me a while to implement. (And I also happened to be investigating guest tokens anyway for people who would rather not apply for a dev account) I still agree the timeline approach is something we want to pursue and would be a nice option when using the bot. I'll change the title of the issue to reflect that if you're ok with that. Oh, just a last remark. If you happen to have access to or run private proxies, putting them into the |
It did just crash out with this error: Error log✖ 2022-11-29 12:57:33,853 - pleroma_bot - ERROR - Exception occurred for user, skipping... (cli.py:707)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 125, in process_tweets
_get_rt_media_url(self, tweet, media)
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 264, in _get_rt_media_url
tweet_rt = self._get_tweets("v2", tweet_id)
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 411, in _get_tweets
tweet_id=tweet_id, start_time=start_time, t_user=t_user, pbar=pbar
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 478, in _get_tweets_v2
params=params
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request
"Rate limit exceeded. 0 out of {} requests remaining until {}"
TypeError: 'list' object is not callable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 643, in main
tweets, user, threads
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 121, in process_parallel
p.imap_unordered(user.process_tweets, tweets_chunked)
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
raise value
TypeError: 'list' object is not callable
ERROR:pleroma_bot:Exception occurred for user, skipping...
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 125, in process_tweets
_get_rt_media_url(self, tweet, media)
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_processing.py", line 264, in _get_rt_media_url
tweet_rt = self._get_tweets("v2", tweet_id)
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 411, in _get_tweets
tweet_id=tweet_id, start_time=start_time, t_user=t_user, pbar=pbar
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 478, in _get_tweets_v2
params=params
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_twitter.py", line 37, in twitter_api_request
"Rate limit exceeded. 0 out of {} requests remaining until {}"
TypeError: 'list' object is not callable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/cli.py", line 643, in main
tweets, user, threads
File "/usr/local/lib/python3.6/site-packages/pleroma_bot/_utils.py", line 121, in process_parallel
p.imap_unordered(user.process_tweets, tweets_chunked)
File "/usr/lib64/python3.6/multiprocessing/pool.py", line 735, in next
raise value
TypeError: 'list' object is not callable (Occasional crashes don't bother me much, but I figured you'd like to know.) |
Ah, of course, the requests using the guest tokens don't contain the same rate limiting headers as the proper API when hitting an 429 (for whatever reason). It should be included in |
OK, this one may be problematic. Feel free to tell me if it isn't workable. :-)
I'm hoping to minimize latency between the time a tweet is posted and the time it's mirrored into a fediverse post, so I'm hitting the 300 requests per 15 minutes rate limiter pretty often. I'm guessing this is because the bot is making a tweets requests from Twitter for each configured account, and I have 45 in there at the moment and would like to be able to support way more than that.
(This is for https://twitter.oksocial.net/about; that page describes the service.)
I'm pretty sure Twitter's API lets you give it a list of accounts to pull tweets from on each request, rather than just a single account, so it should be possible for the bot to batch all the accounts for which it does not have specific login info into a single request?
I suspect that's much more complicated than adding the bot attribute was, but thanks for considering it. :-)
-robin
The text was updated successfully, but these errors were encountered: