Mike Fährmann
bc0e853d30
combine KeyError & IndexError to common base class LookupError
3 years ago
Mike Fährmann
bc868e7bb8
consider apparently long extensions as part of the filename
...
(#1516 )
3 years ago
Mike Fährmann
387fe415d5
unescape items in text.split_html()
4 years ago
Mike Fährmann
78fd63b8f0
remove 'text.clean_xml()'
...
was not used anywhere
4 years ago
Mike Fährmann
8553b218d9
replace calls to 'os.path.splitext()' with 'str.rpartition()'
...
Makes functions who used it more than twice as fast
and we can get rid of an import as well.
4 years ago
Mike Fährmann
a09f42f6b3
improve filename_from_url() performance
...
Manually extracting the part between the last '/' and '?' instead of
relying on the standard libraries' 'urllib.parse.urlsplit()' increases
performance by ~400%.
urlsplit() : 3.64 secs per 1.000.000 iterations
partition(): 0.87 secs per 1.000.000 iterations
4 years ago
Mike Fährmann
37d71f6e09
strip microseconds in text.parse_datetime()
4 years ago
Mike Fährmann
6294e2c540
add 'text.ensure_http_scheme()'
4 years ago
Mike Fährmann
a0f4c295c0
add optional 'utcoffset' argument to 'parse_datetime()'
5 years ago
Mike Fährmann
f6c5edb76b
pre-compile regex pattern for remove_html() and split_html()
5 years ago
Mike Fährmann
b1bea8aaeb
add 'restrict-filenames' option ( #348 )
5 years ago
Mike Fährmann
1740086d8a
add 'repl' and 'sep' arguments to text.replace_html()
5 years ago
Mike Fährmann
b171befa87
implement 'parse_unicode_escapes()'
5 years ago
Mike Fährmann
2b1999476e
implement 'text.rextract()'
5 years ago
Mike Fährmann
2316e0ed3d
fix strptime workaround from b0e85a4
...
Don't return a modified version of 'date_time' if strptime fails.
5 years ago
Mike Fährmann
b0e85a42e3
apply workaround from 4736912
in parse_datetime() itself
5 years ago
Mike Fährmann
d09864b581
implement text.parse_datetime()
5 years ago
Mike Fährmann
6264a46212
use 'utcfromtimestamp()'
...
'fromtimestamp()' converts its results to the local timezone and causes
problems when running tests on a different machine.
5 years ago
Mike Fährmann
d670de0344
implement 'text.parse_timestamp()'
5 years ago
Mike Fährmann
21a7e395a7
implement convenience wrapper for text.extract functionality
6 years ago
Mike Fährmann
8f249f1d54
improve text.extract_iter() performance
...
by roughly 40% through
- inlining code
- pre-calculating reused values
- entering a try-except block only once
6 years ago
Mike Fährmann
5530871b5a
change results of text.nameext_from_url()
...
Instead of getting a complete 'filename' from an URL and splitting that
into 'name' and 'extension', the new approach gets rid of the complete
version and renames 'name' to 'filename'. (Using anything other than
{extension} for a filename extension doesn't really work anyway)
Example: "https://example.org/path/filename.ext "
before:
- filename : filename.ext
- name : filename
- extension: ext
now:
- filename : filename
- extension: ext
6 years ago
Mike Fährmann
e1d3e9a926
add 'ext_from_url' to text.py
6 years ago
Mike Fährmann
2d2953a5bf
add 'text.parse_float()' + cleanup in text.py
6 years ago
Mike Fährmann
ae9a37a528
implement text.split_html()
6 years ago
Mike Fährmann
cc36f88586
rename safe_int to parse_int; move parse_* to text module
7 years ago
Mike Fährmann
4ffa94f634
remove 'shorten_path()' and 'shorten_filename()'
7 years ago
Mike Fährmann
27eab4e467
rewrite text tests and improve functions
...
- test more edge cases
- consistently return an empty string for invalid arguments
- remove the ungreedy-flag in 'remove_html()'
7 years ago
Mike Fährmann
e3f2bd4087
add tests for 'text.clean_xml()' and improve it
7 years ago
Mike Fährmann
6d8b191ea7
improve 'parse_query()' and add tests
...
- another irrelevant micro-optimization !
- use urllib.parse.parse_qsl directly instead of parse_qs, which
just packs the results of parse_qsl in a different data structure
- reduced memory requirements since no additional dict and lists are
created
7 years ago
Mike Fährmann
731ffd4986
improve text.filename_from_url() performance
...
- urlsplit() is faster than urlparse()
- rpartition() is faster than rindex() + slicing
- new version is 2.3 times as fast
7 years ago
Mike Fährmann
f7cdfd4c25
add a simplified version of 'parse_qs'
...
This version only returns a dict of plain string to string key-value
pairs and ignores multiple values for the same query variable.
7 years ago
Mike Fährmann
e5f79ae839
[deviantart] add support for all media types
...
- this includes
- images
- videos
- flash-animations
- journals
- also renamed some of the extractors
- User -> Gallery
- Image -> Deviation
7 years ago
Mike Fährmann
ed94d9b92d
fix/improve various things
8 years ago
Mike Fährmann
619c74159a
[seiga] fix file extension and xml parsing
...
- The file extension of the first image had been used for all further
images
- API responses can contain invalid characters, which cause the XML
parser to fail (http://seiga.nicovideo.jp/user/illust/26377934
contains several \x08 characters)
8 years ago
Mike Fährmann
4f123b8513
code adjustments according to pep8
8 years ago
Mike Fährmann
8780abcc77
fix a small spelling error
8 years ago
Mike Fährmann
00074a71d7
several changes to make travis build work
...
- fixed html.unescape not being available on Python3.3
- removed inconsistent test result
- added username/password pairs for authenticating extractors
8 years ago
Mike Fährmann
91c446805b
replace platform.system() with os.name
8 years ago
Mike Fährmann
8a49a28d13
replace deprecated 'unescape' method
9 years ago
Mike Fährmann
99b4fbb081
implement text.extract_iter
9 years ago
Mike Fährmann
7fd284a705
always provide lowercase fileextensions
9 years ago
Mike Fährmann
ca523b9f64
add helper method to text module
9 years ago
Mike Fährmann
d0bebd9ce3
allow adding values to existing dict
9 years ago
Mike Fährmann
629133a27a
document text.extract
9 years ago
Mike Fährmann
692d0c95cc
reimplement text.extract_all
9 years ago
Mike Fährmann
db479f881d
implement text.shorten_path/filename methods
9 years ago
Mike Fährmann
89f938ee55
handle non string-like arguemnts for clean_path
9 years ago
Mike Fährmann
c5801c9770
combine text related functions in new module
9 years ago