355 Commits

Author SHA1 Message Date
JustAnotherArchivist
c72bf3174f Use assignment expressions for cleaner code 2020-10-01 18:54:57 +00:00
JustAnotherArchivist
472cef2382 Add support for link previews 2020-10-01 18:51:14 +00:00
JustAnotherArchivist
b1d8475a03 Fix link extraction on Telegram 2020-10-01 18:29:08 +00:00
JustAnotherArchivist
3d3faf80bf Add python_requires to make it even clearer that 3.8+ is required 2020-09-26 16:32:00 +00:00
JustAnotherArchivist
bbb372284b Bump Python version in README 2020-09-26 15:56:55 +00:00
JustAnotherArchivist
8cf81e9bfc Fix twitter-profile scraper
The Twitter API returns different data structures there, leading to a variety of errors.
2020-09-25 02:45:07 +00:00
JustAnotherArchivist
d90f06b389 Extract more information on users from Twitter
Closes #78
2020-09-24 18:39:32 +00:00
JustAnotherArchivist
c519832755 Clarify twitter-list-posts argument value 2020-09-24 18:37:37 +00:00
JustAnotherArchivist
397a0b988e Remove Twitter list member scraper
It has been broken for a while. Member lists were removed from the old design, and they're behind a login wall on the new design.
2020-09-24 18:34:15 +00:00
JustAnotherArchivist
f1428fa0e0 Fix crash on nested quoted tweets 2020-09-24 02:45:49 +00:00
JustAnotherArchivist
7d2c546ee5 Deprecate hacky fields in Tweet objects 2020-09-24 02:00:45 +00:00
JustAnotherArchivist
2332c30e26 Replace locale-dependent strptime date parsing with email.utils.parsedate_to_datetime 2020-09-24 02:00:21 +00:00
JustAnotherArchivist
b78bf3e642 Fix crash on banner-less profiles and nested descriptionUrls 2020-09-24 01:58:38 +00:00
JustAnotherArchivist
1a09f9b9a3 Extract more information from Twitter
Including: reply/retweet/like/quote counts, media (photos, videos, and GIFs), full user object, quoted tweets, mentioned users, rendered content, conversation ID, language, source
2020-09-24 01:45:08 +00:00
JustAnotherArchivist
5ae5ec7bcd Bump Python version classifier
Python 3.8 is required since commit 1a2e367a.
2020-09-23 22:25:38 +00:00
JustAnotherArchivist
c0ff6631aa Update README 2020-09-22 22:30:08 +00:00
JustAnotherArchivist
ae60a4d0fd Add Weibo scraper
Closes #52
2020-09-13 02:27:35 +00:00
JustAnotherArchivist
800cfd5be0 Add support for disabling following redirects 2020-09-13 00:52:26 +00:00
JustAnotherArchivist
f296f9d21d Refactor post extraction of VK again to work around their weird behaviours
VK doesn't always return posts in chronological order, so that can't be used to filter out duplicates. Instead, remember the last 1k post IDs and filter using that. This should catch the vast majority of duplicates. (Also, duplicates can't only happen in the geoblocking workaround; sometimes, VK also simply returns the same post again for no obvious reason.)
2020-09-12 02:00:50 +00:00
JustAnotherArchivist
8265ffc19e Work around geoblocked posts on VK
To get around the block, try to iterate over post offsets individually instead of in 10-steps. This means we should get every post that isn't blocked as long as there are at least 10 posts between two blocked ones.

Fixes #68
2020-09-12 02:00:26 +00:00
JustAnotherArchivist
f8efe98608 Fix post order on VK: reinsert pinned post at the correct location in the stream 2020-09-12 00:03:29 +00:00
JustAnotherArchivist
2b5444f89e Restrict --max-results to zero or positive values; use zero to indicate fetching only the entity 2020-09-11 15:37:22 +00:00
JustAnotherArchivist
07d446fd19 Fix crash in VK scraper 2020-09-10 21:05:03 +00:00
JustAnotherArchivist
a25426043b Fix Telegram username canonicalisation 2020-09-09 09:33:57 +00:00
JustAnotherArchivist
84692846b9 Fix crash in Telegram scraper 2020-09-09 09:22:00 +00:00
JustAnotherArchivist
039b2c6719 Restructure Twitter classes since the 'common' scraper is only used for the old design anymore 2020-09-07 02:38:27 +00:00
JustAnotherArchivist
70a3d9ba3a Fix infinite loop at the end of profile pages 2020-09-01 04:01:27 +00:00
JustAnotherArchivist
bd619bf4e9 Log and ignore tweets which are not contained in the globalObjects
Fixes #61
2020-09-01 03:45:23 +00:00
JustAnotherArchivist
072519f539 Fix pagination on profile pages 2020-09-01 03:23:45 +00:00
JustAnotherArchivist
d9572ec450 Correctly serialise nested NamedTuples 2020-09-01 03:16:25 +00:00
JustAnotherArchivist
ba250aabf2 Extract retweeted tweet if present 2020-09-01 03:15:21 +00:00
JustAnotherArchivist
0cc4f0c016 Add support for Twitter profile pages
Closes #5
2020-09-01 03:13:49 +00:00
JustAnotherArchivist
1a2e367a87 Cache entities 2020-09-01 02:34:21 +00:00
JustAnotherArchivist
4f24843f89 Extract user ID 2020-09-01 02:26:13 +00:00
JustAnotherArchivist
bfb92a47b9 Move Tweet object generation to TwitterAPIScraper 2020-09-01 02:25:00 +00:00
JustAnotherArchivist
dc5d55004b Refactor API interaction into something cleaner and more reusable 2020-09-01 01:56:07 +00:00
JustAnotherArchivist
d8e7f96d4d Add support for Reddit
Closes #15
2020-08-31 03:38:20 +00:00
JustAnotherArchivist
bb83d1d72f Validate Twitter usernames
Closes #55
2020-08-24 19:03:52 +00:00
JustAnotherArchivist
1480260e47 Handle Telegram channels without public posts 2020-08-24 17:54:30 +00:00
JustAnotherArchivist
c8d688d39f Fix crash on Telegram pages without a description 2020-08-24 17:53:50 +00:00
JustAnotherArchivist
9df4352089 Fix crash on VK pages without an info div 2020-08-24 17:42:33 +00:00
JustAnotherArchivist
dd25fd0526 Add support for extracting the entity behind a scrape
Closes #11

Backwards incompatibility: snscrape.modules.twitter.Account is now called User. However, this was previously only used on the list member scraper, which has been broken for a while since the list member list is no longer publicly accessible.

For compatibility reasons, the CLI does not output the entity by default; the new option --with-entity enables it.
2020-08-24 01:38:27 +00:00
JustAnotherArchivist
c90fd54b6b Make datetime.date serialisable 2020-08-24 01:12:38 +00:00
JustAnotherArchivist
9528df48cd Refactor base URL handling 2020-08-24 01:12:06 +00:00
JustAnotherArchivist
924c35f883 Refactor guest token extraction code 2020-08-22 22:59:43 +00:00
JustAnotherArchivist
588ec415ff Force TwitterThreadScraper to fetch the old design (take 2) 2020-08-12 17:19:42 +00:00
JustAnotherArchivist
bf229414ba Add JSONL output format 2020-08-12 15:09:02 +00:00
JustAnotherArchivist
afa819547d Update README 2020-08-11 22:18:04 +00:00
JustAnotherArchivist
dbcdc159ef Add support for scraping Facebook page visitor posts aka 'Community'
Closes #18
2020-08-11 22:14:27 +00:00
JustAnotherArchivist
30f945897a Clean Facebook group post URLs
Most of the time, the URLs are already clean, but occasionally, Facebook includes tracking parameters (__xts__[0], __tn__)...
2020-08-11 20:48:14 +00:00