Download website text only

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 11305
Joined: 24 Jan 2010, 23:23
Location: paused.undefined.exposed

Download website text only

Post by ChrisGreaves »

I am looking to download the text ONLY of a specific web site - no MP3 files, no images, and wonder what people are using in the year 2020.

I gave www.httrack.com a quick spin, but it seemed to whir away for an hour and then paused, restarting when I clicked somewhere.
I am looking for just a quick-grab of the text. I will be analyzing it later, so i will have access to my own text-cleanup tools.
Thanks
Chris
People who live in glass houses shouldn’t grow zucchinis

curious
3StarLounger
Posts: 250
Joined: 25 Jan 2010, 17:36

Re: Download website text only

Post by curious »

I think you'll find Nuke Anything what you're looking for:

https://download.cnet.com/Nuke-Anything ... 48044.html

jolas
2StarLounger
Posts: 126
Joined: 02 Feb 2010, 23:58

Re: Download website text only

Post by jolas »

If by any chance you are using chrome browser or the new Edge browser.

There is an chrome extension that also works with the new Edge browser called Reader View.

There is an option to print, save, change font size and even background shade, etc. There is a toggle option to hide/show images as well.

Firefox has a similar extension but kinda limited in my opinion.

Hope this helps.

User avatar
BobH
UraniumLounger
Posts: 7901
Joined: 13 Feb 2010, 01:27
Location: Temple - Deep in the Heart of Texas

Re: Download website text only

Post by BobH »

curious wrote:
16 Sep 2020, 23:01
I think you'll find Nuke Anything what you're looking for:

https://download.cnet.com/Nuke-Anything ... 48044.html
I followed the link to CNET without problems but on the CNET page there is no option to download the software but there is a link to the software publisher's page that took me to a 404 page.
Regards, BobH
Story of my life: I knew better but did it anyway!
Intel Core i5, 3570K, 3.40 GHz, 16 GB RAM, ECS Z77 H2-A3 Mobo, Windows 7 >HPE 64-bit, MS Office 2016

User avatar
HansV
Administrator
Posts: 69049
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Download website text only

Post by HansV »

If you're using Chrome or Edge: Nuke Anything Enhanced
If you're using Firefox: Nuke Anything Enhanced

Warning: the extension hasn't been updated since 2017, so it might not be compatible with the current browser version.
Regards,
Hans

User avatar
BobH
UraniumLounger
Posts: 7901
Joined: 13 Feb 2010, 01:27
Location: Temple - Deep in the Heart of Texas

Re: Download website text only

Post by BobH »

Thanks, Hans!
Regards, BobH
Story of my life: I knew better but did it anyway!
Intel Core i5, 3570K, 3.40 GHz, 16 GB RAM, ECS Z77 H2-A3 Mobo, Windows 7 >HPE 64-bit, MS Office 2016

curious
3StarLounger
Posts: 250
Joined: 25 Jan 2010, 17:36

Re: Download website text only

Post by curious »

BobH -

Sorry you had that problem. Hans' suggestion is most likely an updated version, so I urge you to try that.

User avatar
BobArch2
5StarLounger
Posts: 1095
Joined: 25 Jan 2010, 22:25
Location: Pickering, Ontario, Canada

Re: Download website text only

Post by BobArch2 »

HansV wrote:
17 Sep 2020, 18:58
If you're using Chrome or Edge: Nuke Anything Enhanced
If you're using Firefox: Nuke Anything Enhanced

Warning: the extension hasn't been updated since 2017, so it might not be compatible with the current browser version.
I have often thought that there must be some tool to filter out unwanted material. Thanks Hans ... and I did see your warning. :grin:
Regards,
Bob

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 11305
Joined: 24 Jan 2010, 23:23
Location: paused.undefined.exposed

Re: Download website text only

Post by ChrisGreaves »

jolas wrote:
17 Sep 2020, 02:43
There is an option to print, save, change font size and even background shade, etc. There is a toggle option to hide/show images as well
Hi jolas, and my apologies for the delay.
This is close to what i want, but I do not want to hide the non-text.
I want the site-downloader to filter the non-text automatically.

So far the applications I've tried either
(1) download the entire site, then i must write a post-processor to filter out non-text or
(2) provide switches which in my stumbling way I can't get to work.
Cheers
Chris
People who live in glass houses shouldn’t grow zucchinis

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 11305
Joined: 24 Jan 2010, 23:23
Location: paused.undefined.exposed

Re: Download website text only

Post by ChrisGreaves »

HansV wrote:
17 Sep 2020, 18:58
If you're using Firefox: Nuke Anything Enhanced
Thanks Hans.
I believe I expressed myself poorly in the original post.

I'm looking for an application that will download an entire web site (or at least, will download everything at or below the url) AND will strip out all non-text material as it goes.
Think "off-line analysis of user comments" as an example.

Cheers
Chris
People who live in glass houses shouldn’t grow zucchinis

jolas
2StarLounger
Posts: 126
Joined: 02 Feb 2010, 23:58

Re: Download website text only

Post by jolas »

You maybe aware of this already. Chrome, the new Edge Browser as well as Firefox do have a Save As /Save Page As when you right-click the inside the webpage.

For Chrome and Edge there are three options and save as HTML only would have placeholders for non-text object. Probably this is useful for simple structured webpages.
ChromeSave As HTML only.jpg


For Firefox, interestingly it has a Save as Text Files option aside from Web page, complete and Web Page, HTML only.
Firefox Save Page As.jpg
You do not have the required permissions to view the files attached to this post.

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 11305
Joined: 24 Jan 2010, 23:23
Location: paused.undefined.exposed

Re: Download website text only

Post by ChrisGreaves »

jolas wrote:
20 Sep 2020, 08:06
You maybe aware of this already. Chrome, the new Edge Browser as well as Firefox do have a Save As /Save Page As when you right-click the inside the webpage.
Hi Jolas.
I'm not looking to save "a page".
I'm looking for an application that will download an entire web site.

If Chrome/Firefox/Mosaic browsers can let me point to a site with a URL, and with one click download every page on that site, then they would be a candidate application.
I do not want to save 1,000 pages, one click at a time!
Cheers
Chris
People who live in glass houses shouldn’t grow zucchinis

User avatar
HansV
Administrator
Posts: 69049
Joined: 16 Jan 2010, 00:14
Status: Microsoft MVP
Location: Wageningen, The Netherlands

Re: Download website text only

Post by HansV »

Imagine downloading Eileen's Lounge, with tens of forums, each with many pages; more than 33000 topics, many of which have more than one page; the member list with more than 150 pages; plus all the other pages. An application would have to know the structure of the Lounge to do this in any meaningful way, and if it did, it would probably cause the Lounge to crash. So don't do it! :cranky:
Regards,
Hans

User avatar
ChrisGreaves
PlutoniumLounger
Posts: 11305
Joined: 24 Jan 2010, 23:23
Location: paused.undefined.exposed

Re: Download website text only

Post by ChrisGreaves »

HansV wrote:
20 Sep 2020, 11:22
An application would have to know the structure of the Lounge to do this in any meaningful way, and if it did, it would probably cause the Lounge to crash. So don't do it! :cranky:
Oh, I won't, and it's not on EL anyway.
I used to d/l the 250+ posts from Canada NewsWire six days a week, when it was text-based (some 10-15 years ago), and then parse/extract all press releases for the Toronto/Mississauga area looking for direct lines to CEOs of large financial/pharmaceutical firms.

Prior to that I was commissioned to scour the Toronto Police blotters (again text-based) to provide an Alert service based on postal codes.
Also there was a commission to access Google Financials (and two other sites) for fiscal data on publicly traded companies for a guy who had a foolproof formula for working out which shares to buy. Far as I know he's still living in Toronto; Loser!

This new laptop is not very strong, so any attempt by me to d/l a massive web site will be doomed before I click "OK".

If I were dissecting the Toronto Police Blotter or Canada Newswire today I'd have to cope with all sorts of crud - Links to a/v files, images etc, none of which help in determining a postal code from a few street names, once you've found the street names!

The closest I got to downloading Google was an analyser that would issue a Google Search based on a Canadain Postal Code ("L4X 2G6") and grab the one page of about one hundred hits, parse each hit, and obtain a good directory of that block of the street("L4X 2G6"), the street("L4X 2G"), or the area ("L4X 2").

Cheers
Chris
People who live in glass houses shouldn’t grow zucchinis

User avatar
StuartR
Administrator
Posts: 10971
Joined: 16 Jan 2010, 15:49
Location: London, Europe

Re: Download website text only

Post by StuartR »

There are lots of tools that do this, they are typically used by search engines to extract the data they want to index.

Try using your favourite web search engine to search for "Web crawler"
StuartR


User avatar
ChrisGreaves
PlutoniumLounger
Posts: 11305
Joined: 24 Jan 2010, 23:23
Location: paused.undefined.exposed

Re: Download website text only

Post by ChrisGreaves »

StuartR wrote:
20 Sep 2020, 13:56
Try using your favourite web search engine to search for "Web crawler"
Thank you Stuart.
I have downloaded three to try:-

Code: Select all

setup-cyowcopy-1.8.0-build-652
getleft-setup-v1.2-full
httrack-3.49.2
Cheers
Chris
People who live in glass houses shouldn’t grow zucchinis