Commit Graph

188 Commits

Author SHA1 Message Date
JustAnotherArchivist
7499384110 Merge pull request #131 from gitshrl/facebook/fix-group-pagination
Fix pagination error for Facebook group scraper
2020-10-21 15:08:50 +00:00
sahrul
7a0f68b7ec fix pagination for facebook group scraper 2020-10-21 21:30:00 +07:00
JustAnotherArchivist
1a219fd2b6 Merge pull request #129 from gitshrl/facebook/fix-group-scraper
Update base URL for Facebook group scraper
2020-10-21 14:03:59 +00:00
sahrul
6fb98dae12 update base url for facebook group scraper 2020-10-21 19:57:02 +07:00
JustAnotherArchivist
8c2c0fa47a Remove workaround for http://bugs.python.org/issue16308 as snscrape requires 3.8+ now anyway 2020-10-18 20:25:54 +00:00
JustAnotherArchivist
58c8365c33 Add test extra requirements 2020-10-18 20:03:29 +00:00
JustAnotherArchivist
2c11ec38fa Replace requests.models with plain requests
requests.models is all but undocumented, and the three types needed here are all in the requests namespace as well.
2020-10-18 02:35:55 +00:00
JustAnotherArchivist
fe5e23502d collections.deque support and other minor improvements to snscrape._cli._repr 2020-10-18 02:12:09 +00:00
JustAnotherArchivist
644cd1d2fb Add support for various further complicated types to snscrape._cli._repr 2020-10-18 01:42:45 +00:00
JustAnotherArchivist
5ccfab6314 Add .gitignore 2020-10-18 01:14:04 +00:00
JustAnotherArchivist
bf895ea5b1 Minor README cleanup 2020-10-17 21:21:20 +00:00
JustAnotherArchivist
e956e2562b Replace pkg_resources with importlib.metadata 2020-10-17 21:16:45 +00:00
JustAnotherArchivist
defe874bf4 Fix date extraction on VK
Only the most recent posts have the nice timestamp property...
2020-10-17 02:22:15 +00:00
JustAnotherArchivist
3f8935ee4d Fix crash on video reposts 2020-10-17 02:20:40 +00:00
JustAnotherArchivist
cd12500dbf Fix date extraction on quoted posts 2020-10-17 02:13:27 +00:00
JustAnotherArchivist
5dc61d50ac Add support for outlinks, photos, videos, and quoted posts on VK 2020-10-17 00:07:26 +00:00
JustAnotherArchivist
11a82e110a Remove obsolete comment
Cf. f296f9d2
2020-10-16 18:37:51 +00:00
JustAnotherArchivist
16ebe8bf48 Introduce dedicated IntWithGranularity type and deprecate the direct *Granularity fields 2020-10-16 18:20:47 +00:00
JustAnotherArchivist
1bbe25647a Refactor deprecated properties 2020-10-16 18:11:52 +00:00
JustAnotherArchivist
e22b461563 Add Python 3.9 classifier 2020-10-16 01:27:17 +00:00
JustAnotherArchivist
c4a5715e18 Fix Facebook user and community scrapers
Facebook is redirecting the previous user agent to the mobile site; use current Firefox ESR instead.
2020-10-16 01:20:50 +00:00
JustAnotherArchivist
5cb64faa72 Formally deprecate the already deprecated item attributes 2020-10-16 00:55:55 +00:00
JustAnotherArchivist
0f78aa45fc Refactor --format handling to avoid conversion to dict 2020-10-16 00:55:14 +00:00
JustAnotherArchivist
179112a310 Fix --format
Broken by the switch to dataclasses in bd53e729
2020-10-16 00:27:13 +00:00
JustAnotherArchivist
4ce9ed4eb3 Add --progress option that prints a status update every 100 results and at the end
Closes #116
2020-10-16 00:00:43 +00:00
JustAnotherArchivist
11414cb68f Rename cli module to make it clear that it is considered private API 2020-10-15 23:47:07 +00:00
JustAnotherArchivist
bd53e729a0 Replace named tuples with dataclasses and move JSON conversion logic to the base classes
Named tuples were never really adequate for this since the order aspect of them doesn't make sense.
Further, named tuples don't support multiple inheritance. This meant that the objects returned by get_items() were not actually Items, for example. Since Python 3.9, such named tuples cannot be created anymore.

Fixes #111
2020-10-15 23:44:28 +00:00
JustAnotherArchivist
ffd9289edc Reduce the logging level of retryable retrieval errors from WARNING to INFO
There is no real need to report these as WARNINGs as snscrape tries and in most cases manages to recover. Without --verbose, snscrape's output can be confusing (see #76). If the retries fail as well, snscrape will still log that as an ERROR and crash loudly.
2020-10-11 22:29:27 +00:00
JustAnotherArchivist
b1a7b9607f Skip individual Telegram photo/video links 2020-10-07 01:27:26 +00:00
JustAnotherArchivist
119e53d07c Fix Telegram post URL extraction 2020-10-07 01:15:51 +00:00
JustAnotherArchivist
c3e2e12369 Deprecate outlinksss 2020-10-01 22:00:26 +00:00
JustAnotherArchivist
a70b361176 Use more assignment expressions where appropriate 2020-10-01 21:45:25 +00:00
JustAnotherArchivist
8b68f1a8af Fix link previews for pure-image previews
... and any other preview that doesn't have all the things for some reason.
2020-10-01 18:56:55 +00:00
JustAnotherArchivist
c72bf3174f Use assignment expressions for cleaner code 2020-10-01 18:54:57 +00:00
JustAnotherArchivist
472cef2382 Add support for link previews 2020-10-01 18:51:14 +00:00
JustAnotherArchivist
b1d8475a03 Fix link extraction on Telegram 2020-10-01 18:29:08 +00:00
JustAnotherArchivist
3d3faf80bf Add python_requires to make it even clearer that 3.8+ is required 2020-09-26 16:32:00 +00:00
JustAnotherArchivist
bbb372284b Bump Python version in README 2020-09-26 15:56:55 +00:00
JustAnotherArchivist
8cf81e9bfc Fix twitter-profile scraper
The Twitter API returns different data structures there, leading to a variety of errors.
2020-09-25 02:45:07 +00:00
JustAnotherArchivist
d90f06b389 Extract more information on users from Twitter
Closes #78
2020-09-24 18:39:32 +00:00
JustAnotherArchivist
c519832755 Clarify twitter-list-posts argument value 2020-09-24 18:37:37 +00:00
JustAnotherArchivist
397a0b988e Remove Twitter list member scraper
It has been broken for a while. Member lists were removed from the old design, and they're behind a login wall on the new design.
2020-09-24 18:34:15 +00:00
JustAnotherArchivist
f1428fa0e0 Fix crash on nested quoted tweets 2020-09-24 02:45:49 +00:00
JustAnotherArchivist
7d2c546ee5 Deprecate hacky fields in Tweet objects 2020-09-24 02:00:45 +00:00
JustAnotherArchivist
2332c30e26 Replace locale-dependent strptime date parsing with email.utils.parsedate_to_datetime 2020-09-24 02:00:21 +00:00
JustAnotherArchivist
b78bf3e642 Fix crash on banner-less profiles and nested descriptionUrls 2020-09-24 01:58:38 +00:00
JustAnotherArchivist
1a09f9b9a3 Extract more information from Twitter
Including: reply/retweet/like/quote counts, media (photos, videos, and GIFs), full user object, quoted tweets, mentioned users, rendered content, conversation ID, language, source
2020-09-24 01:45:08 +00:00
JustAnotherArchivist
5ae5ec7bcd Bump Python version classifier
Python 3.8 is required since commit 1a2e367a.
2020-09-23 22:25:38 +00:00
JustAnotherArchivist
c0ff6631aa Update README 2020-09-22 22:30:08 +00:00
JustAnotherArchivist
ae60a4d0fd Add Weibo scraper
Closes #52
2020-09-13 02:27:35 +00:00