PDA

View Full Version : Browsing the web from the command line


mala
12-26-2002, 07:50 AM
Hi!

"browsing the web from the command line" is an old idea I had and didn't have the time to develop, so I publish it here and see if anyone's interested in it...

The idea is not just to access the Web from a shell in text-mode (text-mode browsers exist yet ;)), but to use shell commands -or new hand-made ones- to retrieve only the information we're interested in, or to get the info we couldn't see with a normal text browser. As an example, look at this oneliner from null, which prints absolute urls saved in a flash file:

$ lynx -source http://ret.mine.nu/top.swf | strings | grep http

As another example, give a look at http://surfraw.sourceforge.net, which gives you the chance to redirect a web result on any command line app.

Now you'll probably ask yourselves: "why should we spend time searching for these tricks while we're able to access the Web anyway, with our browsers?". Well, because in this way YOU decide what to see and what to do with the data you download, you choose what to download, you won't have any more popups and banners, and I think just this might be enough :)

mala
12-26-2002, 07:00 PM
I've moved the thread here, since it might become a little more code-oriented... in the meanwhile, I've tried some experiments with wget and I've found something which might come handy in some situations. As an example, to attract your attention, I've made my experiments on porn websites ^__^

Just take one of those "free pr0n" pages, which have new links every day that point to free sections of other websites. I've just run this single line

wget -A mpg,mpeg,avi,asf -r -H -l 2 -nd -t 1 http://url.you.like

went to the cinema and when I was back... 160MB of movies, without having to follow links, click on images, close popups, read banners and so on.

How does it work? Here's a description of the switches I've used:

-A mpg,mpeg,avi,asf
This one makes wget save only files that end with the extensions I've specified

-r
Recursive: follows the links it finds in the homepage

-H
Span hosts: when recursive, it follows the links to foreign hosts too

-l 2 (note: lowercase 'L')
Recursion depth: I've set it as 2, to follow the links in the main page and then the links to the video files in the linked pages

-nd
No directories: doesn't create directories, puts everything in the same dir

-t 1
Retries: I've set it to 1, to avoid losing time retrying after it hasn't found a server


Hope you found it interesting :)

null
01-04-2003, 08:58 PM
Hello...







Here is a way one could get the (nick)name of the last person who posted a reply to a favorite thread on this forum:







$ lynx -nolist -dump 'http://ret.mine.nu/board/viewforum.php?f=3' |


> grep -2 "Browsing the web" | tail -1 | awk '{ print $1 }'







- GUI is designed for USERS.

> All the others use cmdline ...







null

mala
01-05-2003, 05:23 PM
Heh, that's great! :)

In the meanwhile, I've made a little perl script to ease link parsing. That is, a script which allows you to

- extract all the links from one page
- print only the ones that either
- - follow a specified pattern in the URL or
- - follow a specified pattern in the text "tagged" by the anchor

I've tried to cut'n'paste it from the preview and it works. You can use it this way:

perl filename.pl http://url "URL regexp" "text regexp"

For instance,

perl exturl.pl http://ret.mine.nu/links.html "" "descendants"
gives you the link to Immortal Descendants mirror (this version still doesn't convert relative URLs to absolute ones, as you might see)

Instead,
perl exturl.pl http://ret.mine.nu/links.html "cjb"
gives you only links to the websites at cjb.net

perl exturl.pl http://ret.didjitalyphrozen.com/board/sear...rch_author=mala (http://ret.didjitalyphrozen.com/board/search.php?search_author=mala) "viewforum"
returns the url to all the forums of this website where I've written a message

perl exturl.pl http://ret.didjitalyphrozen.com/board/sear...rch_author=mala (http://ret.didjitalyphrozen.com/board/search.php?search_author=mala) "" "stegano|command line"
returns the urls to forums or messages whose subject contain "stegano" or "command line"

-----------------------------------------------------------------------------

#!/usr/bin/perl

use LWP::UserAgent;
use Data::Dumper; # quick and dirty way to dump data on the screen

# this is a very simplified implementation of getpage but should work with
# no problems if you don't have to authentify yourself

sub getpage{
my $url = shift;
my $ua = new LWP::UserAgent;

# note: for some websites we _have_ to provide agent name
$ua->agent('Two/0.1');

# connect to the main url
my $req = new HTTP::Request GET=> $url;
my $res = $ua->request($req);

die "Error connecting to the main page:n".Dumper($res->headers) unless $res->is_success;

return $res->content;
}

sub xtracturl{
my ($content,$regexp1,$regexp2) = @_;

my (@links,@links2);
my %hash;

# powerful regexp! Hope that works :)
while ($content =~ /<s*a.*?hrefs*=[s]*"?([^s">]+).*?>(.*?)</a>/gsi){
my $url = $1;
my $str = $2;
if ($url =~ /$regexp1/i){
push (@links, $url);
}
if ($str =~ /$regexp2/i){
push (@links, $url);
}
}

# clean links array from dupes
for (@links){
$hash{$_}++ || push @links2,$_;
}

return @links2;
}

print join ("n",xtracturl (getpage ($ARGV[0]),$ARGV[1],$ARGV[2]))."n";

null
01-07-2003, 10:13 PM
Looks like a very nice script mala, especially the regexp!





I have two things to point out:





1. You can safely change the <s*a part of the regexp to <a because no whitespaces are allowed between the tag opening sign "<" and the tag name.





2. You know me... I couldn't (almost) post anything here without pasting a nice example from my console:





lynx -dump http://ret.mine.nu/links.html | sed 's/^ *[0-9]*. [^h]*//' | grep '^http'





Note that by default lynx displays all the links found inside a web page when you use the -dump switch. These links are displayed at the end of the output. If you don't want lynx to display these references, you have to use the -nolist option too.



null

mala
01-08-2003, 01:42 PM
1. You can safely change the <s*a part of the regexp to <a because no whitespaces are allowed between the tag opening sign \"<\" and the tag name.

Good! I wasn't sure about that, so I added those chars without even reading any html specs ;)

2. You know me... I couldn't (almost) post anything here without pasting a nice example from my console:

I know you, and I couldn't wait to see it :)

As usual it's very nice! And please, let me comment it to

1) See if I understood it well
2) Explain it to others so we will be able to build a little "regexp tute step by step" ;)


lynx -dump http://ret.mine.nu/links.html | sed 's/^ *[0-9]*. [^h]*//' | grep '^http'


Well, the steps of this oneliner are the following ones:

1) Connect with links to http://ret.mine.nu/links.html and output the dump of the page, that is NOT the source but the translated page, with a list of links at its end (this last detail is important ;))
2) Then the output of links is piped to sed, which takes it as an input and processes it with a substitution of "something" (we'll see it later)
3) Finally, the output of sed is piped to grep which returns only the lines which begin with http (that '^' before http means, generally, "the string begins here": in these case we are working with every line of the page processed by sed.

What does sed do?

's/^ *[0-9]*. [^h]*//'

good, we can see an s///, which means substitution: the syntax is

s/string to substitute/new string/

and, since we have s/something//, we can understand that we actually want to DELETE something which satisfies the regexp in the first section.

^ *[0-9]*. [^h]* means:

^ the line will begin here
_* (that is, <space>*) 0 or more spaces (* means 0 or more)
[0-9]* 0 or more digits (well, maybe I'd put a "+" here, since we should always have at least one)
._ a dot followed by a space (the dot has to be escaped with '' because it has anoter meaning -that is 'any char'- otherwise)
[^h]* anything which is not h, 0 or more times (^ at the beginning of the square brackets means 'not')

Really nice, I'm going to work at the recursive exturl perl sub (I have a working version yet, but not looping, and I'm working on a looping one which allows users to specify depth - what do I mean with looping? Well, I'll explain it better in my next message ;))

null
01-11-2003, 06:04 PM
Oops!

Thanks mala for doing my job. It was actually my responsibility to explain the pipeline, but you have done it. Thanks again, it's a really good analysis - looks like you have something to do with reverse engineering, can that be!?

- Greetz -

null

score
01-11-2003, 10:45 PM
mala: i noticed in one of your posts a - .*? - in a regexp. i'm new to regexps and i understood
- . - means "any char"
- * - means 0 or more times
then, i know - ? - means 0 or 1 'times' (or if you will, 'possibly' this char...)

but i still don't understand what - .*? - does?
(still, i did use it in codes - along you lines - and it helped me a lot....)

could you help?


ps. anyway i found the book "mastering regular expressions in perl" very helpful

mala
01-12-2003, 09:20 AM
mala: i noticed in one of your posts a - .*? - in a regexp. i'm new to regexps and i understood *
- . - means \"any char\" *
- * - means 0 or more times
then, i know - ? - means 0 or 1 'times' (or if you will, 'possibly' this char...)

but i still don't understand what - .*? - does?
(still, i did use it in codes - along you lines - and it helped me a lot....)

could you help?

Sure!

That question mark, used after the asterisk, is the "greedy" operator. Usually, in a regular expression, when you write a .* you mean "match everything" and this is quite a powerful command: if you don't specify the match to be greedy, then everything until the LAST occurrence of the text which follows the .* will match.

As an example:

<a href=".*">

will match both <a href="page.htm"> and <a href="page.htm" target="wherever">, which is not what you want ;)

if, instead, you write

<a href=".*?">

it will match everything between double quotes, stopping at the FIRST double quote it finds.

ps. anyway i found the book \"mastering regular expressions in perl\" very helpful

Here's a couple of links for the ones who would like to give a look at this book:

http://www.trig.org/cs/
http://books.pdox.net/Computers/

(quite easy to find, just put the title of the book inside google ;))

null
01-15-2003, 07:15 PM
I would like to add something about the "traditional" way of greedy matching. I'll explain it on a simple example which kills the tags in a HTML file. You could do something like:

$ cat index.htm | sed 's/<[^>]*>//g'

... and it would remove all the tags from the input file (here: index.htm) under certain circumstances (to be explained by mala !!!). The regexp is here <[^>]*> which means:

1. match the '<'
2. match everything which is not a '>' zero or more times - [^>]*
3. match the '>'

This way the matching stops at the first occurence of '>' in contrary to <.*> which "eats" all the chars between the first '<' and the last '>' on a line.

null

mala
01-16-2003, 11:59 AM
$ cat index.htm | sed 's/<[^>]*>//g'

... and it would remove all the tags from the input file (here: index.htm) under certain circumstances (to be explained by mala !!!).

Hmm... I don't know if there's anything more (I'm not good with sed) but AT LEAST the tag should be all on one line: this is because the input file isn't considered as one line, thing that with Perl could be done with /gs instead of /g (g=global, s=single line). Hey, what about sed?

... hope that's all ;)

mala
01-16-2003, 04:50 PM
$ lynx -source http://ret.mine.nu/top.swf | strings | grep http

People, this oneliner really made me think: it wasn't just the start of this _beautiful_ thread, but left me with the desire of making flash scripts accessible -at least for what concerns links- to flash-unabled users. That oneliner has a problem: it catches only the urls which start with http, that is only absolute ones. Reversing some flash files and applying the right regexp I created this little -but, I hope, useful- script:

#!/usr/bin/perl



undef $/;

$_ = <>;



# SYNTAX IS: 0x00 0x83 0xlength 0x00 "string" 0x00

while (/x00x83.x00(.*?)x00/gs){

* * * *print "$1n";

}



That's all. Call it however you like (let's say flash.pl) and then:

$ lynx -dump http://ret.mine.nu/top.swf | perl flash.pl
or
$ lynx -source http://ret.mine.nu/top.swf | perl flash.pl

will give you:

/
members.html
ret.viewer?p=Essays
ret.viewer?p=Tools
ret.viewer?p=Challenges
ret.viewer?p=Console
ret.viewer?p=Stegano
http://ret.didjitalyphrozen.com/board/index.php
links.html
contact.html

null
01-16-2003, 05:23 PM
Hmm... I don't know if there's anything more (I'm not good with sed) but AT LEAST the tag should be all on one line: this is because the input file isn't considered as one line, thing that with Perl could be done with /gs instead of /g (g=global, s=single line). Hey, what about sed? *


Yes, this is what I meant - every tag should be on a single line. Thanks for the explanation. I don't think that sed offers any options to transform the input into a single line, but ...

... we are using the UNIX shell, so never be affraid of a new challenge! You can solve _almost everything_ from the shell. Here, tr is our friend:

[b]$ cat index.htm | tr -d '

scorer
01-16-2003, 06:18 PM
mala, in your exturl.pl script, you stated it doesn't convert relative URLs to absolute ones.
i found an easy workaround, that you may already know, for this.

use URI:URL;

sub rel2abs {
my $rel = shift;
my $base = shift;
$uri = URI->new_abs($rel, $base);
return $uri->as_string;
}

for example:
$rel = /file.zip
$base = http://server.org/path/index.html
outputs => http://server.org/file.zip
and
$rel = file.zip
$base = http://server.org/path/index.html
outputs => http://server.org/path/file.zip

you may want to play with it to see how it behaves.

~

aside from this, i'm currently working out a searching script based on ideas found in essays on searchlores (lexi_wot, lexi_lau).

i have some (simple) ideas about how to improve on this, and would enjoy to share these, and some codes.
let me know if you're on the same track.
(the goal here is to build something able to parse results from *any* SE/database/searching facility...)

mala
02-14-2003, 11:21 AM
Hi! :)

I'm sorry I've been away for so much time... I had some projects going on which took quite much time. But now I've written a nice crypto paper for beginners (unfortunately, it's encrypted itself... with an algorithm called Italian: anyone who wants to read it anyway, and maybe translate it, is welcome :wink:) and made some experiments with genetic algorithms (I've done my first genetic keygen! :wink:)

mala, in your exturl.pl script, you stated it doesn't convert relative URLs to absolute ones.
i found an easy workaround, that you may already know, for this.

Thank you very much, in the meanwhile I already had updated that script, because I needed one which gave me absolute urls. The state of the work now is that I can connect to any PhpBB messageboard and dump all its messages on my hd... it's awful (one big HTML file at the moment) but I'm working on the mysql interface in these days :)

aside from this, i'm currently working out a searching script based on ideas found in essays on searchlores (lexi_wot, lexi_lau).

i have some (simple) ideas about how to improve on this, and would enjoy to share these, and some codes.
let me know if you're on the same track.
(the goal here is to build something able to parse results from *any* SE/database/searching facility...)

It's a very nice topic and it's quite similar to the one I'd like to carry on: accessing the Web in an intelligent way, keeping the interesting content and stripping away the uninteresting one. So please, if you like and have the time share your ideas and I'm sure many will participate (but I might be wrong ;)).

mala
03-29-2003, 09:38 AM
Hi ppl! :)

I know much time passed, but I've worked a little on that flesh regexp this afternoon and thought this might be useful to someone: given an url, the script downloads a flash file and extracts its urls; then it looks ahead, in the linked pages, the <title> and creates some html code which contains a "usable" menu.
I think my next step will be to join this one to a little proxy (I have some ready source code for this) and make an automatic converter which makes flash websites accessible to browsers which are not flash capable... will let you know ;)

(hey, thank you very much again, scorer, for your url conversion sub: as you can see, I've used it here! :))



#!/usr/bin/perl



use LWP::Simple;

use URI::URL;



sub rel2abs {

* * * *my ($rel,$base) = @_;

* * * *my $uri = URI->new_abs($rel, $base);

* * * *return $uri->as_string;

}



$url * = $ARGV[0];

$flash = get($url) || die "Could not download $ARGV[0]";



# SYNTAX IS: 0x00 0x83 0xlength 0x00 "string" 0x00

while ($flash =~ /x00x83.x00(.*?)x00/gs){

* * * *my $nextitle;

* * * *my $link = rel2abs ($1,$url);

* * * *my $nextpage = get ($link);

* * * *if ($nextpage =~ /<title>(.*?)</title>/i){

* * * * * * * *$nextitle = $1;

* * * *}else{

* * * * * * * *$nextitle = $link;

* * * *}

* * * *print qq|<a href="$link">$nextitle</a><br>n|;

}

mala
03-29-2003, 11:24 AM
... I've just uploaded the first release of TWO on sourceforge:

http://two.sf.net

I hope this will help, if not in browsing forums offline effectively, at least in creating new power browsing tools. Plese, let me know what you think of it and if you found useful for your purposes. Any feedback is appreciated (insults only by private mail ;))