JustAnotherArchivist
c0ff6631aa
Update README
2020-09-22 22:30:08 +00:00
JustAnotherArchivist
ae60a4d0fd
Add Weibo scraper
...
Closes #52
2020-09-13 02:27:35 +00:00
JustAnotherArchivist
800cfd5be0
Add support for disabling following redirects
2020-09-13 00:52:26 +00:00
JustAnotherArchivist
f296f9d21d
Refactor post extraction of VK again to work around their weird behaviours
...
VK doesn't always return posts in chronological order, so that can't be used to filter out duplicates. Instead, remember the last 1k post IDs and filter using that. This should catch the vast majority of duplicates. (Also, duplicates can't only happen in the geoblocking workaround; sometimes, VK also simply returns the same post again for no obvious reason.)
2020-09-12 02:00:50 +00:00
JustAnotherArchivist
8265ffc19e
Work around geoblocked posts on VK
...
To get around the block, try to iterate over post offsets individually instead of in 10-steps. This means we should get every post that isn't blocked as long as there are at least 10 posts between two blocked ones.
Fixes #68
2020-09-12 02:00:26 +00:00
JustAnotherArchivist
f8efe98608
Fix post order on VK: reinsert pinned post at the correct location in the stream
2020-09-12 00:03:29 +00:00
JustAnotherArchivist
2b5444f89e
Restrict --max-results to zero or positive values; use zero to indicate fetching only the entity
2020-09-11 15:37:22 +00:00
JustAnotherArchivist
07d446fd19
Fix crash in VK scraper
2020-09-10 21:05:03 +00:00
JustAnotherArchivist
a25426043b
Fix Telegram username canonicalisation
2020-09-09 09:33:57 +00:00
JustAnotherArchivist
84692846b9
Fix crash in Telegram scraper
2020-09-09 09:22:00 +00:00
JustAnotherArchivist
039b2c6719
Restructure Twitter classes since the 'common' scraper is only used for the old design anymore
2020-09-07 02:38:27 +00:00
JustAnotherArchivist
70a3d9ba3a
Fix infinite loop at the end of profile pages
2020-09-01 04:01:27 +00:00
JustAnotherArchivist
bd619bf4e9
Log and ignore tweets which are not contained in the globalObjects
...
Fixes #61
2020-09-01 03:45:23 +00:00
JustAnotherArchivist
072519f539
Fix pagination on profile pages
2020-09-01 03:23:45 +00:00
JustAnotherArchivist
d9572ec450
Correctly serialise nested NamedTuples
2020-09-01 03:16:25 +00:00
JustAnotherArchivist
ba250aabf2
Extract retweeted tweet if present
2020-09-01 03:15:21 +00:00
JustAnotherArchivist
0cc4f0c016
Add support for Twitter profile pages
...
Closes #5
2020-09-01 03:13:49 +00:00
JustAnotherArchivist
1a2e367a87
Cache entities
2020-09-01 02:34:21 +00:00
JustAnotherArchivist
4f24843f89
Extract user ID
2020-09-01 02:26:13 +00:00
JustAnotherArchivist
bfb92a47b9
Move Tweet object generation to TwitterAPIScraper
2020-09-01 02:25:00 +00:00
JustAnotherArchivist
dc5d55004b
Refactor API interaction into something cleaner and more reusable
2020-09-01 01:56:07 +00:00
JustAnotherArchivist
d8e7f96d4d
Add support for Reddit
...
Closes #15
2020-08-31 03:38:20 +00:00
JustAnotherArchivist
bb83d1d72f
Validate Twitter usernames
...
Closes #55
2020-08-24 19:03:52 +00:00
JustAnotherArchivist
1480260e47
Handle Telegram channels without public posts
2020-08-24 17:54:30 +00:00
JustAnotherArchivist
c8d688d39f
Fix crash on Telegram pages without a description
2020-08-24 17:53:50 +00:00
JustAnotherArchivist
9df4352089
Fix crash on VK pages without an info div
2020-08-24 17:42:33 +00:00
JustAnotherArchivist
dd25fd0526
Add support for extracting the entity behind a scrape
...
Closes #11
Backwards incompatibility: snscrape.modules.twitter.Account is now called User. However, this was previously only used on the list member scraper, which has been broken for a while since the list member list is no longer publicly accessible.
For compatibility reasons, the CLI does not output the entity by default; the new option --with-entity enables it.
2020-08-24 01:38:27 +00:00
JustAnotherArchivist
c90fd54b6b
Make datetime.date serialisable
2020-08-24 01:12:38 +00:00
JustAnotherArchivist
9528df48cd
Refactor base URL handling
2020-08-24 01:12:06 +00:00
JustAnotherArchivist
924c35f883
Refactor guest token extraction code
2020-08-22 22:59:43 +00:00
JustAnotherArchivist
588ec415ff
Force TwitterThreadScraper to fetch the old design (take 2)
2020-08-12 17:19:42 +00:00
JustAnotherArchivist
bf229414ba
Add JSONL output format
2020-08-12 15:09:02 +00:00
JustAnotherArchivist
afa819547d
Update README
2020-08-11 22:18:04 +00:00
JustAnotherArchivist
dbcdc159ef
Add support for scraping Facebook page visitor posts aka 'Community'
...
Closes #18
2020-08-11 22:14:27 +00:00
JustAnotherArchivist
30f945897a
Clean Facebook group post URLs
...
Most of the time, the URLs are already clean, but occasionally, Facebook includes tracking parameters (__xts__[0], __tn__)...
2020-08-11 20:48:14 +00:00
JustAnotherArchivist
eee5794ff9
Extract Facebook group post in chronological order (instead of by last comment)
...
Fixes #66
2020-08-11 20:47:42 +00:00
JustAnotherArchivist
966a6ebd8e
Skip promoted tweets/ads
...
Fixes #67
2020-08-11 20:28:35 +00:00
JustAnotherArchivist
4d3d0fe0d7
Update search API parameter values to the ones currently used on Twitter
2020-08-11 20:26:56 +00:00
JustAnotherArchivist
7b967ff82a
Twitter reverted their guest token change ( 90f9598e)
v0.3.4
2020-07-08 22:07:18 +00:00
JustAnotherArchivist
90f9598ecc
Adjust to Twitter's new method of handing out guest tokens
...
Fixes #64
v0.3.3
2020-06-24 21:22:58 +00:00
JustAnotherArchivist
7b3c7deb28
Catch login redirects on Instagram
v0.3.2
2020-05-30 00:56:34 +00:00
JustAnotherArchivist
040a11656c
Update README
2020-05-30 00:53:52 +00:00
JustAnotherArchivist
1459245258
Consistently raise ScraperException on fatal errors
2020-05-30 00:53:49 +00:00
JustAnotherArchivist
dbe4c5ce55
Remove Google+ module
...
Google+ was mostly shut down in early 2019. What remained (Google+ for G Suite) was renamed to Google Currents and is for internal communication only (and therefore out of scope for snscrape).
2020-05-30 00:35:06 +00:00
JustAnotherArchivist
80491ecc2c
Remove Gab module
...
Since Gab's move to a fork of Mastodon in July 2019, the module had been broken, and a new module would better be written from scratch as the platform changed entirely.
2020-05-30 00:23:33 +00:00
JustAnotherArchivist
1a71b58101
Add support for Telegram
...
Closes #50
2020-05-29 23:44:01 +00:00
JustAnotherArchivist
0ce37a69d4
Log exception details on crashes
2020-05-29 22:29:23 +00:00
JustAnotherArchivist
722bfd5f7c
Handle Twitter tombstones
...
Fixes #63
2020-05-29 22:12:37 +00:00
JustAnotherArchivist
b6cc3180d9
Force TwitterThreadScraper and TwitterListMembersScraper to fetch the old design
v0.3.1
2020-03-04 00:40:49 +00:00
JustAnotherArchivist
613395d1c2
Port TwitterSearchScraper to redesign
...
Fixes #57
2020-03-04 00:40:49 +00:00