JustAnotherArchivist
11a82e110a
Remove obsolete comment
...
Cf. f296f9d2
2020-10-16 18:37:51 +00:00
JustAnotherArchivist
16ebe8bf48
Introduce dedicated IntWithGranularity type and deprecate the direct *Granularity fields
2020-10-16 18:20:47 +00:00
JustAnotherArchivist
1bbe25647a
Refactor deprecated properties
2020-10-16 18:11:52 +00:00
JustAnotherArchivist
e22b461563
Add Python 3.9 classifier
2020-10-16 01:27:17 +00:00
JustAnotherArchivist
c4a5715e18
Fix Facebook user and community scrapers
...
Facebook is redirecting the previous user agent to the mobile site; use current Firefox ESR instead.
2020-10-16 01:20:50 +00:00
JustAnotherArchivist
5cb64faa72
Formally deprecate the already deprecated item attributes
2020-10-16 00:55:55 +00:00
JustAnotherArchivist
0f78aa45fc
Refactor --format handling to avoid conversion to dict
2020-10-16 00:55:14 +00:00
JustAnotherArchivist
179112a310
Fix --format
...
Broken by the switch to dataclasses in bd53e729
2020-10-16 00:27:13 +00:00
JustAnotherArchivist
4ce9ed4eb3
Add --progress option that prints a status update every 100 results and at the end
...
Closes #116
2020-10-16 00:00:43 +00:00
JustAnotherArchivist
11414cb68f
Rename cli module to make it clear that it is considered private API
2020-10-15 23:47:07 +00:00
JustAnotherArchivist
bd53e729a0
Replace named tuples with dataclasses and move JSON conversion logic to the base classes
...
Named tuples were never really adequate for this since the order aspect of them doesn't make sense.
Further, named tuples don't support multiple inheritance. This meant that the objects returned by get_items() were not actually Items, for example. Since Python 3.9, such named tuples cannot be created anymore.
Fixes #111
2020-10-15 23:44:28 +00:00
JustAnotherArchivist
ffd9289edc
Reduce the logging level of retryable retrieval errors from WARNING to INFO
...
There is no real need to report these as WARNINGs as snscrape tries and in most cases manages to recover. Without --verbose, snscrape's output can be confusing (see #76 ). If the retries fail as well, snscrape will still log that as an ERROR and crash loudly.
2020-10-11 22:29:27 +00:00
JustAnotherArchivist
b1a7b9607f
Skip individual Telegram photo/video links
2020-10-07 01:27:26 +00:00
JustAnotherArchivist
119e53d07c
Fix Telegram post URL extraction
2020-10-07 01:15:51 +00:00
JustAnotherArchivist
c3e2e12369
Deprecate outlinksss
2020-10-01 22:00:26 +00:00
JustAnotherArchivist
a70b361176
Use more assignment expressions where appropriate
2020-10-01 21:45:25 +00:00
JustAnotherArchivist
8b68f1a8af
Fix link previews for pure-image previews
...
... and any other preview that doesn't have all the things for some reason.
2020-10-01 18:56:55 +00:00
JustAnotherArchivist
c72bf3174f
Use assignment expressions for cleaner code
2020-10-01 18:54:57 +00:00
JustAnotherArchivist
472cef2382
Add support for link previews
2020-10-01 18:51:14 +00:00
JustAnotherArchivist
b1d8475a03
Fix link extraction on Telegram
2020-10-01 18:29:08 +00:00
JustAnotherArchivist
3d3faf80bf
Add python_requires to make it even clearer that 3.8+ is required
2020-09-26 16:32:00 +00:00
JustAnotherArchivist
bbb372284b
Bump Python version in README
2020-09-26 15:56:55 +00:00
JustAnotherArchivist
8cf81e9bfc
Fix twitter-profile scraper
...
The Twitter API returns different data structures there, leading to a variety of errors.
2020-09-25 02:45:07 +00:00
JustAnotherArchivist
d90f06b389
Extract more information on users from Twitter
...
Closes #78
2020-09-24 18:39:32 +00:00
JustAnotherArchivist
c519832755
Clarify twitter-list-posts argument value
2020-09-24 18:37:37 +00:00
JustAnotherArchivist
397a0b988e
Remove Twitter list member scraper
...
It has been broken for a while. Member lists were removed from the old design, and they're behind a login wall on the new design.
2020-09-24 18:34:15 +00:00
JustAnotherArchivist
f1428fa0e0
Fix crash on nested quoted tweets
2020-09-24 02:45:49 +00:00
JustAnotherArchivist
7d2c546ee5
Deprecate hacky fields in Tweet objects
2020-09-24 02:00:45 +00:00
JustAnotherArchivist
2332c30e26
Replace locale-dependent strptime date parsing with email.utils.parsedate_to_datetime
2020-09-24 02:00:21 +00:00
JustAnotherArchivist
b78bf3e642
Fix crash on banner-less profiles and nested descriptionUrls
2020-09-24 01:58:38 +00:00
JustAnotherArchivist
1a09f9b9a3
Extract more information from Twitter
...
Including: reply/retweet/like/quote counts, media (photos, videos, and GIFs), full user object, quoted tweets, mentioned users, rendered content, conversation ID, language, source
2020-09-24 01:45:08 +00:00
JustAnotherArchivist
5ae5ec7bcd
Bump Python version classifier
...
Python 3.8 is required since commit 1a2e367a .
2020-09-23 22:25:38 +00:00
JustAnotherArchivist
c0ff6631aa
Update README
2020-09-22 22:30:08 +00:00
JustAnotherArchivist
ae60a4d0fd
Add Weibo scraper
...
Closes #52
2020-09-13 02:27:35 +00:00
JustAnotherArchivist
800cfd5be0
Add support for disabling following redirects
2020-09-13 00:52:26 +00:00
JustAnotherArchivist
f296f9d21d
Refactor post extraction of VK again to work around their weird behaviours
...
VK doesn't always return posts in chronological order, so that can't be used to filter out duplicates. Instead, remember the last 1k post IDs and filter using that. This should catch the vast majority of duplicates. (Also, duplicates can't only happen in the geoblocking workaround; sometimes, VK also simply returns the same post again for no obvious reason.)
2020-09-12 02:00:50 +00:00
JustAnotherArchivist
8265ffc19e
Work around geoblocked posts on VK
...
To get around the block, try to iterate over post offsets individually instead of in 10-steps. This means we should get every post that isn't blocked as long as there are at least 10 posts between two blocked ones.
Fixes #68
2020-09-12 02:00:26 +00:00
JustAnotherArchivist
f8efe98608
Fix post order on VK: reinsert pinned post at the correct location in the stream
2020-09-12 00:03:29 +00:00
JustAnotherArchivist
2b5444f89e
Restrict --max-results to zero or positive values; use zero to indicate fetching only the entity
2020-09-11 15:37:22 +00:00
JustAnotherArchivist
07d446fd19
Fix crash in VK scraper
2020-09-10 21:05:03 +00:00
JustAnotherArchivist
a25426043b
Fix Telegram username canonicalisation
2020-09-09 09:33:57 +00:00
JustAnotherArchivist
84692846b9
Fix crash in Telegram scraper
2020-09-09 09:22:00 +00:00
JustAnotherArchivist
039b2c6719
Restructure Twitter classes since the 'common' scraper is only used for the old design anymore
2020-09-07 02:38:27 +00:00
JustAnotherArchivist
70a3d9ba3a
Fix infinite loop at the end of profile pages
2020-09-01 04:01:27 +00:00
JustAnotherArchivist
bd619bf4e9
Log and ignore tweets which are not contained in the globalObjects
...
Fixes #61
2020-09-01 03:45:23 +00:00
JustAnotherArchivist
072519f539
Fix pagination on profile pages
2020-09-01 03:23:45 +00:00
JustAnotherArchivist
d9572ec450
Correctly serialise nested NamedTuples
2020-09-01 03:16:25 +00:00
JustAnotherArchivist
ba250aabf2
Extract retweeted tweet if present
2020-09-01 03:15:21 +00:00
JustAnotherArchivist
0cc4f0c016
Add support for Twitter profile pages
...
Closes #5
2020-09-01 03:13:49 +00:00
JustAnotherArchivist
1a2e367a87
Cache entities
2020-09-01 02:34:21 +00:00