-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
is it possible to output regular files instead of warc? #228
Comments
May be of interest to you: https://replayweb.page/ can load WARCs and allow you to browse them. It works best on websites that don't heavily rely on JavaScript. I'd suggest to use wpull on its own (grab-site is basically wpull but tuned for easier crawling) but the current state of wpull outside of wrappers like this is awful. :/ |
thanks. i'm familiar with replayweb, but warc is really not for me. i want the option to be able to do things like:
it's just easier for me to work with files. tbh, i would just use i've tried:
neither works. any tips? it's very frustrating because i've had luck using
but curl has no crawling functionality. |
Does grab-site work with the cookie issue? Go into the exported_from_firefox.txt file and check for any #HtttpOnly lines. Those are a common problem with cookies.txt parsers as they aren't part of any official specification. I've had luck occasionally with removing the |
i was so super frustrated with trying to extract files from old WARCs from another project that i didn't even bother trying
i'll look into your cookie tips and see if i can get |
at first glance, i think your |
@TheTechRobo Seconding this about plain HTML files but for the reason of plugging it into AI document parsers like Khoj or GPT4All, summarizing blogs and making personal assistants out of it is kinda lit. |
i only want files, not warc.
can grab-site output regular files (like html and images) for me like wget can? (links must be converted to relative links)
side question: has anyone here actually had good results with getting files back out of warc? this wouldn't be such a big deal if that were possible. i've never seen a util that can exract files from warcs with 100% success rate (and it's usually insanely slow).
i've tried:
extracted.001
, and idk how to get past thatThe text was updated successfully, but these errors were encountered: