This reverts commit 3f513f1056.
Both live.staticflickr and farmN.staticflickr servers now produce the
same image file with a lower overall quality than before this change in
Flickr's end.
Flickr started serving images from live.staticflickr.com (see ec88ff1),
but the old farmN.staticflickr.com URLs still work - at least for the
time being.
Filesize (and most likely quality as well) for images from live.… is
severely reduced compared to images from farmN.… for non-original files,
so all live URLs are replaced to point to a randomly chosen farm server.
- 'post_id' and 'image_id' are only unique per user
- /image/ pages only show a maximum of 24 images, but there can be more
images than that in a blog post
- let extraction run in its own thread and maybe improve speed
- #190
This commit adds support for the two new JS expressions embedded in the
overall challenge code.
It does compute the correct 'js_answer' value, but the HTTP request to
/cdn-cgi/l/chk_jschl to get the 'cf_clearance' cookie always results in
a 403 response with a CAPTCHA inside (hence 'wip')
All steps to make this HTTP request indistinguishable from a regular web
browser (which passes the test) show no effect. This includes:
- using the exact same HTTP headers as a web browser
- follow query argument order
- different wait times
Images are now randomly served from the 'live.staticflickr.com' domain
instead of the "old" 'farmN.staticflickr.com' one, making it impossible
to use static 'url' and 'keyword' hashes as results.
Image quality doesn't appear to be effected by which image-server is
used. Files from 'farmN' and 'live' are the same.
removes basically all metadata, but that can be compensated for with the
right search query. writing "parsers" for all 4 possible views that have
been introduced in the latest changes is too much of a hassle ...
Add support for hashtags (TagPage-s), i.e. explore/tags/<tag> URLs.
This also introduce a get_metadata() method in order to append
possible further metadata per-(sub)extractor.
Refactor and generalize _extract_profilepage() to _extract_page()
in order to be reused by _extract_profilepage() and _extract_tagpage()
simply by passing the type of page (`ProfilePage' or `TagPage') and picking up
the respective fields in shared data.
* [instagram] Add support for GraphSidecar media types
Refactor _extract_postpage() to always return a list of medias.
Fetch common keywords and gracefully handle GraphSidecar media type
by extracting each single media and adding `sidecar_media_id' and
`sidecar_shortcode' keywords to indicate the parent of sidecar
childrens.
While here join the copyright comment lines in a single one.
Closes#178.
* [instagram] Use `yield from' instead of `for ... yield' (thanks @mikf)!
* [instagram] Adjust filename for GraphSidecar medias
Add a possible leading `media_id' of the sidecar for GraphSidecar
media.
Thanks to @mikf for the suggestion!
* [instagram] Add extra metadata for youtube-dl in GraphSidecar childrens
GraphSidecar children ytdl: URLs when consumed by youtube-dl
redirects to the URL of their parent. In GraphSidecar-s with
multiple GraphVideo-s this leads to downloading the same video
multiple times.
Add a `_ytdl_index' field to indicate the index of the youtube-dl
playlist corresponding the children of the sidecar.
This will be used by the `ytdl' downloader.
- use original image if available
- support video formats
- remove user info for ImageExtractor (it is no longer possible to get
image owner information for a single image)
An URL alone isn't good enough to distinguish between a gallery or a
gallery-listing, so the new extractor decides what to do based on the
page's content.
- Sometimes an ad interfered when trying to get a download URL
- Resolving "www.hentai-foundry.com" yields an invalid(?) IPv6 address
(2607:5300:60:ca9e:feed:dead:beef:1) and urllib3 only tries to connect
to the IPv4 variant after a rather long wait time
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)
Example: "https://example.org/path/filename.ext"
before:
- filename : filename.ext
- name : filename
- extension: ext
now:
- filename : filename
- extension: ext
This allows for stuff like "{extractor.url}" and "{extractor.category}"
in logging format strings.
Accessing 'extractor' and 'job' in any way will return "None" if those
fields aren't defined, i.e. in general logging messages.
Child extractors are now directly constructed with Extractor.from_url()
if the extractor class is known beforehand, instead of using
extractor.find() and searching through all possible extractor classes.
Instead of a strict list of (URL, RESULTS)-tuples, extractor result
tests can now be a single (URL, RESULTS)-tuple, if it's just one test,
and "only matching" tests can now be a simple string.
HTML structure for gallery pages changed quite a bit, so it is now using
the embedded JSON data. This changes a lot of metadata field names, but
'gallery_id', 'title', and 'user' are still provided for backwards
compatibility.
The internal API endpoint for user galleries also changed its data
structure, but nothing too major.
- allow instances to specify their own 'category'
- improve config lookup:
- first look into extractor.<category>.*
- and afterwards look into extractor.mastodon.<instance>.*
- add a default entry for pawoo.net in a way that actually works
- add an 'instance' keyword and turn 'tags' into a usable list
The former implementation would produce a complete list of all subalbums
for each (sub)album extraction. This would for example result in a
level 2 subalbum getting "extracted" twice: once through the root-album
(level 0) and once through its parent album on level 1.
In the current implementation only the next level of subalbums are
returned, which themselves will handle their next level in a recursive
fashion.
Extractors for Mastodon instances can now be dynamically generated,
based on the instance names in the 'extractor.mastodon.*' config path.
Example:
{
"extractor": {
"mastodon": {
"pawoo.net": { ... },
"mastodon.xyz": { ... },
"tabletop.social": { ... },
...
}
}
}
Each entry requires an 'access-token' value, which can be generated with
'gallery-dl oauth:mastodon:<instance URL>'.
An 'access-token' (as well as a 'client-id' and 'client-secret') for
pawoo.net is always available, but can be overwritten as necessary.
Using the same base-dict for each asset of a project causes unwanted
side effects like re-using image filename extensions for videos,
resulting in errors with the youtube-dl downloader.
... via HTTP Basic Auth with username and "password".
The password value in this case is not the account password itself,
but the"api_key" found in your user profile.
Hidden / dashboard-only blogs are pretty straightforward and "only"
require a valid 'access-token' and 'access-token-secret' for the given
'api-key' and 'api-secret', so that signed OAuth1.0 requests are possible.
Private / password protected blogs on the other hand are a bit
cumbersome. In addition to a valid 'access-token' and
'access-token-secret', they also require the account belonging to those
tokens to be a member of the blog itself. Knowing the password and
entering it in the website isn't enough to access a blog through the
API. Following a private blog is also impossible, so that option can't
work either.
* [instagram] Add extractor for instagram.com user profiles and pages
The extractor scrapes `instagram.com/<user>' timelines and
`instagram.com/p/<shortcode>' by mimicking the behaviour of a web
browser and extracting the sharedData JSON of the single pages.
Please note that this mean that for user timelines we also do an
extra request to the `instagram.com/p/<shortcode>' page but this
permit to have consistent (and all) information about the media
fetched.
The MD5 logic used for X-Instagram-GIS was documented in
<https://stackoverflow.com/questions/49786980/>
* [instagram] Test for keywords, not url for GraphImage and GraphSidecar
URLs returned by instagram seems not stable so avoid testing for
them and instead test for keyword returned.
* [instagram] Improve test of InstagramProfilepageExtractor
Also check the count of media returned.
* [instagram] Several cleanup and improvements
- Change description, subcategories to generate a better description in
docs/supportedsite.rst
- Remove not needed InstagramExtractor.__init__()
- Use text.parse_int() instead of directly using int() (the former is more
robust)
- Use self.request().json() instead of using json.loads() the
self.request().text()
- Add `pattern:' to check the URLs where we do not have a stable URLs.
It seems that only the subdomain is not stable.
Thanks to @mikf!
While a filename might not be a real 'hash', or comparable to what
tumbler usually provides, it is still better than an empty string.
At least as long as "alternatives" in format strings aren't implemented.
The "default" downloader options (rate, retries, timeout, verify) are
mapped to corresponding youtube-dl options.
downloader.ytdl.logging tells the downloader to pass youtube-dl's output
to a Logger object.
downloader.ytdl.raw-options allows to pass arbitrary options to the
YoutubeDL constructor.
from: 5xx HTTP Error: Reason
to : 5xx: Reason
The "HTTP Error" part was in there to emulate Request's error messages
from response.raise_for_status(), but it reads a lot better without.
In addition to 'abort' and 'exit', it is now possible to specify
'abort:N' and 'exit:N' (where N is any integer) as value for 'skip'
to abort/exit after consecutively skipping N downloads.
The first login will still use username and password, but everything
afterwards will use the refresh_token obtained from that.
This will prevent pixiv from sending a "New login to pixiv" email every
time a new access_token is requested.
The functionality of --(chapter-)filter and --(chapter-)range are now
also exposed as the following config-file options:
- extractor.*.image-filter
- extractor.*.image-range
- extractor.*.chapter-filter
- extractor.*.chapter-range
TODO: update configuration.rst
This change introduces 'extractor.*.retries/timeout/verify' options
as a general way to set these values for all HTTP requests.
'downloader.http.retries/timeout/verify' is a way to override these
options for file downloads only and will fall back to 'extractor.*.…*
values if they haven't been explicitly set.
Also: downloader classes now take an extractor object as first argument
instead of a requests.session.
URLs starting with 'ytdl:' will now be handled by youtube-dl.
There is probably a lot to fix and improve, but the basic use case
works.
TODO:
- format selection and ytdl options in general
- better filename/path handling
- ytdl support for "unsupported URLs"
- ...
Enabling this option will detect videos in tweets and output them as
"unsupported" URLs, so that these can then be downloaded with youtube-dl
There are a lot of improvements to be made to the current
implementation, but it works and does what it is supposed to, even if
inefficient as can be ...
and some smaller changes ...
'user' is the name of the account an image is listed at and
'artist' is now the name of the account who created the image.
For example "https://www.hentai-foundry.com/user/Tenpura/faves/pictures"
- 'user': Tenpura
- 'artist' of the only image: LewdBrush
- rename "deleted" to "same-blog"
- change test for deleted original post to test if
original post owner has the same UUID (full blog name) as the one
being downloaded from
- add 'blog[uuid]' metadata to allow comparison with
'reblogged_from_uuid'
Setting 'reblogs' to "deleted" will check if the parent post of a
reblog has been deleted and download its media content if that is the
case, otherwise it will be skipped.
This is a rather costly operation (1 API request per reblogged post)
and should therefore be used with care.
Each post-processor config dict now supports a list of extractor
categories for which it should/shouldn't be active for.
For example:
"postprocessors": [
{"name": "classify",
"whitelist": ["tumblr", "deviantart"],
...
}
]
A format string now gets parsed only once instead of re-parsing it each
time it is applied to a set of data.
The initial parsing causes directory path creation to be at about 2x
slower than before, since each format string there is used only once,
but building a filename, the more common operation, is at least 2x
faster. The "directory slowness" cancels at about 5 filenames and
everything above that is significantly faster.
http://subapics.com/ got discontinued and replaced by http://ngomik.in/.
ngomik.in is still displaying a link to the "old site" showing a big
"Account Suspended" sign.
For example "https://twitter.com/PicturesEarth/media".
They are different from normal timelines in that they do not contain
any (re)tweets from other users and feature all media the user ever
posted, including responses to other tweets.
- rename User- to TimelineExtractor
- rename 'userid' to 'user_id' to conform to the other ..._id values
- adjust archive_fmt to deal with retweets
- emulate browser behavior for API calls
- cache manga API results
- add artist, author and date fields to chapter metadata
- remove Manga-/ChapterExtractor inheritance
- minor code simplifications and improvements
The L option allows for the contents of a format field to be replaced
with <replacement> if its length is greater than <maxlen>.
Example:
{f:L5/too long/} -> "foo" (if "f" is "foo")
-> "too long" (if "f" is "foobar")
(#92) (#94)
All API requests now always use a public token and only switch to
a private token for pagination results if `refresh-token` is set
and less deviations than requested were returned.
Always trying with a public token first and repeating the API request
with a private token if deviations are missing doesn't quite work for
galleries and folders with less than 25 items, so its an option and
not the default.
Instead of using a refresh-token-based access-token for every API
request, they are now only used for paginated results.
API requests to get a user's profile and the original download URL
now always use a public access-token.
By default FFmpeg assumes a 25 FPS input frame rate, leading to dropped
frames if the source requires a higher frame rate than that.
This commit adds a `framerate` option (default "auto"), which allows to
automatically assign a (more or less) fitting frame rate based on
delays between ugoira frames and avoids dropped frames.
It is now possible to slice string (or list) values of format string
replacement fields with the same syntax as in regular Python code.
"{digits}" -> "0123456789"
"{digits[2:-2]}" -> "234567"
"{digits[:5]}" -> "01234"
The optional third parameter (step) has been left out to simplify things.
DeviantArt changed its URL format from
https://<name>.deviantart.com/...
to
https://www.deviantart.com/<name>/...
With this change both formats will be supported.
- ffmpeg-location: path to the ffmpeg (or avconv) executable
- ffmpeg-args: additional command line args for ffmpeg
- extension: filename extension of the resulting video file
Useful for quick testing (even though -g and -j kind of do the same)
and to fill a download archive without actually downloading the files.
-s does the same as the default behaviour, except downloading stuff.
Maybe it should get a more fitting name, as it does actually write to
disk (cache, archive)?
- combine 'favorite' and 'bookmark' extractors
- it is now one extractor class, but its subcategory still
distinguishes between your own bookmarks ('bookmark') and other
user's bookmarks ('favorite') like before
- allow filtering by bookmark tags and public/private bookmarks
- fix pagination for bookmark results
The API endpoint responsible for user illustrations does not
provide sufficient filter capabilities* to match the actual
website, so we are spinning our own filters.
Respected parameters are
'type': illust, manga, ugoira
'tag' : any image tag (this was already supported)
'p' : the page to start on
*
- API can filter for illustrations and manga, but not for ugoira.
- 'offset' is applied before filtering
- no 'tag' filter
Transitioning to the App API breaks favorites archive IDs (there is
no longer any bookmark ID information), but the favorites API endpoint
of the public API was gone anyways ...
OAuth support for SmugMug needs some additional features
(auth-rebuild on redirect, query parameters in URL, ...)
and fixing this in the old code wouldn't work all that well.
Standard logging to stderr, logfiles, and unsupported URL files (which
are now handled through the logging module) can now be configured by
setting their respective option keys (log, logfile, unsupportedfile)
to a dict and specifying the following options;
- format:
format string for logging messages
available keys: see [1]
default: "[{name}][{levelname}] {message}"
- format-date:
format string for {asctime} fields in logging messages
available keys: see [2]
default: "%Y-%m-%d %H:%M:%S"
- level:
the lowercase levelname until which the logger should activate;
available levels are debug, info, warning, error, exception
default: "info"
- path:
path of the file to be written to
- mode:
'mode' argument when opening the specified file
can be either "w" to truncate the file or "a" to append to it (see [3])
If 'output.log', '.logfile', or '.unsupportedfile' is a string, it will
be interpreted, as it has been, as the filepath
(or as format string for .log)
[1] https://docs.python.org/3/library/logging.html#logrecord-attributes
[2] https://docs.python.org/3/library/time.html#time.strftime
[3] https://docs.python.org/3/library/functions.html#open
just some initial code that still requires a lot of work ...
TODO:
- folders
- old-style albums (which are nearly all of them ...)
- images from users
- OAuth
It could also happen that the API credentials used will become invalid
whenever my 14 day trial period ends (7 days remaining), but that
would just require users to supply their own.
The previous implementation would retry requests with 4xx status codes
in an infinite loop, which is especially a problem when querying
non-existent users or groups. These are now properly handled with a
NotFoundError exception.
Pinterest access tokens are rate limited at 200 requests per
hour (or maybe per 2 or 3 hours?) so having just one access token
for all users isn't going to work in the long run.
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
created
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
created
calling 'abort()' in a filter aborts the current extractor run
in a cleaner way than using something like 1/0, which
causes an error message to be printed
https://img.yt/ wasn't available for a couple of days, but has now
re-emerged as https://imx.to/ with a new web-interface.
Links to older images still work (see tests).
'hash' is the middle part of the filename in a tumblr image URL.
For example an image with '.../tumblr_p6tgemp1NZ1wgha4yo1_250.png' as
its URL would have 'p6tgemp1NZ1wgha4yo1' as hash.
- fix the cloudflare challenge result if the last decimal places
are zero (JS`s toFixed() removes trailing zeroes)
- fix downloading of kissmanga chapter-pages hosted on blogspot
(accessing blogspot with "kissmanga.com" as referrer yields a 401)
- disable certificate validation for 'mangahere' tests
- update flickr test result
Cloudflare challenges, at least for kissmanga and readcomiconline,
now use slightly different Javascript expressions.
Instead of a single value per expression, they now have a numerator
and a denominator of a fractional value, which in the end gets
truncated to 10 decimal places.
- safeprint() was used to print values which might have caused a
UnicodeEncodeError, but that is no longer necessary (0381ae5)
- errors are now handled via logging output (f94e370)
Python3.5 and lower throw an UnicodeEncodeError when trying to print
not-encodable characters when not using 'utf-8' as encoding.
Setting their error handlers to 'replace' should help.
- add 'title' and 'description'
- split 'artist_id' into 'user_id' and 'artist_id'
- 'user_id' is the ID of the user from which the image entry
originates from
- 'artist_id' is the ID of the actual image artist
- improve pagination and URL patterns
- use '?a.hitomi.la' as subdomain depending in gallery-id
- add 'characters', 'tags' and 'date' information
- support multiple entires per metadata-value
- rename 'num' to 'page'
{folder[index]} and {collection[index]} are both '0' when being
delegated from Gallery- or FavoriteExtractors, as there is no
way of knowing a folder's index when getting folder-information
from the API.
... to behave in a more straightforward way when dealing with
bookmarks/favourites/etc.
specific IDs are now grouped by their owner, album-id, ... to
allow for duplicates when it would be expected.
- simplify regex
- unquote search tags
- increase default wait-time between HTTP requests
- downloading several hundreds of images always resulted
in '429 Too Many Requests' eventually
- circumvent paging restrictions for unauthenticated users by only
using the 'next' parameter
- setting 'page' to a constant, low value (or simply omitting it)
does the trick
Missing or undefined keywords will now be replaced with the value
set for 'keywords-default'. The default is Python's 'None', which
is equivalent to setting this option to JSON's 'null'.
Instead of a dictionary/object, input file options are now specified
by a 'key=value' pair starting with '-' for options only applying to
the next URL or '-G' for Global options applying to all following URLs.
See the docstring of parse_inputfile() for details.
Example option specifiers:
- filename = "{id}.{extension}"
- extractor.pixiv.user.directory = ["Pixiv Users", "{user[id]}"]
-spaces="are_optional"
-G keywords = {"global": "option"}
- see docstring of parse_inputfile() for details
- TODO: unittests, recursion (currently setting for example
{"extractor": {"key": "value"}} will override the whole "extractor"
branch instead of merging {"key": "value"} into the already existing
dictionary)
- requests and urllib3 version on 1 line
- close input file after reading from it
- use expand_path for unsupported-urls file
- remove unnecessary logging from options.py
- "count" can now be a string defining a comparison in the form of
'<operator> <value>', for example: '> 12' or '!= 1'. If its value
is not a string, it is assumed to be a concrete integer as before.
- "keyword" can now be a dictionary defining tests for individual keys.
These tests can either be a type, a concrete value or a regex
starting with "re:". Dictionaries can be stacked inside each other.
Optional keys can be indicated with a "?" before its name.
For example:
"keyword:" {
"image_id": int,
"gallery_id", 123,
"name": "re:pattern",
"user": {
"id": 321,
},
"?optional": None,
}
This allows the DeviantArt group-check to be moved inside the
Extractor.items() method which in turn allows for better exception
handling.
As a new general rule:
Never raise exceptions during extractor initialization.
Gelbooru's API allows access to all images and is not restricted
to the first 20000.
This also adds an option to select between API use and manual
information extraction in case their API gets disabled again.