Power Browsing by Malattia

RELATED TITLES
10 views 1 0
Power Browsing by Malattia

Uploaded by Sanjin Vukojević 
Full description
   
Save Embed Share Print Steve Jobs Cryptonomicon The Woman Who
Smashed Codes: A True
,-"^"-,._.,-"^"-,._.,-"^"-,._.,-"^"-,.
~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWh
Chose~
POWERBROWSING
v1.2.1 (en)
+mala, 20050318
malattia(at)gmx(dot)net
~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWhatYouChose~WhatYouSeeIsWh
Chose~
,-"^"-,._.,-"^"-,._.,-"^"-,._.,-"^"-,.
1. INTRO
1.1 How are you browsing now?
1.2 How does it work, instead?
1.3 Notes
2. TECHNOLOGIES
2.1 Why HTTP?
2.2 Why HTML?
3. PB TECHNIQUES: tools and basics
3.1 Alternative browsers
3.2 Leechers and Teleporters
3.3 Spiders and scrapers
3.4 Proxy-like software
4. PB TECHNIQUES: advanced
4.1 Learn to search
4.2 Experiments with curl
4.3 Wget and lynx oneliners
4.4 Fight Against Flash
5. BOT BASICS
5.1 Detecting Web Patterns
5.2 Website navigation with bots
5.3 Data Extraction
6. PERL POWERBROWSING TOOLS
6.1 Why Perl?
6.2 Perl Packages
6.3 LWP::Simple
6.4 LWP::UserAgent
7. SHARE YOUR DATA
7.1 Web and RSS
RELATED TITLES
10 views 1 0

Full description
   
-----------------------------------------------------------------------------
This paper is dedicated to Scott R.Lemmon, Proxomitron's author. What you've
started will never stop, and your ideas will be forever alive, around the Net
and inside our minds.
-----------------------------------------------------------------------------
A big THANK YOU to Andreageddon, who helped me with the English translation
when I didn't have enough time to do it :)
==============================
============================================== ==================================
==================
=========
1. INTRO
-----------------------------------------------------------------------------
What is PowerBrowsing? The text you can read around the title is a good
explaination of this word: PowerBrowsing means browsing the Web seeing only
what you chose to see. Even if this might seem an easy thing to do, it is not
and it will become harder and harder in the future... unless, of course, you
become "PowerBrowsers".
This text tries to explain how the Web works now and how you can PowerBrowse,
using ready made tools or creating new ones which fit your needs.
1.1 How are you browsing now?

-----------------------------------------------------------------------------
How do most of you browse now? Well, you probably use the browser which is
installed by default in your system which, most of the times, means Internet
Explorer. The chances to customize the way you see Web pages is limited by
the options your browser allows you to change, so you probably download all
the images inside a website and all the active contents like Flash and Java
applets, you see more and more advertisements inside Web pages and you have
to close tons of popups when you visit some websites.
I'm sure that some of you are already getting angry for this generalization,
because they use another browser... hey, maybe another operating system too!
Well, let me try to guess anyway: despite of this, you usually follow the
links your browser (whatever it is) shows you, you see pages inside windows
RELATED TITLES
10 views 1 0

Full description
   
need: maybe you don't see it, but be sure your modem notices it!
Well, anyway, why bother? Maybe it's not exactly what you wanted, but modems
are becoming faster and faster, and after all this is what you're given and
you can't change it much.
1.2 How does it work, instead?

-----------------------------------------------------------------------------
Hey, wake up! I've got a piece of news for you: a computer is not a TV! It's
supposed to do what _you_ ask, not what _others_ want it to do. Also, things
are not always as they look: before the end of this text, you will realize
that most of the times you're able to see what a browser doesn't directly
show you, and to hide what it shows you by default.
Now, think about downloading one tenth of the data you usually download,
skipping all the advertisements, avoiding popups, keeping interesting data
(and only them) on your computer and accessing them while you're offline, in
a custom, easier and more effective way. Think that, once those information
are on your hard disk, you can write programs which work on them to produce
new, even more interesting information. Finally, think about the chance of
making all these data available to everyone, maybe in an automatical way.
All of this is
i s what I call PowerBrowsing.
1.3 Notes
-----------------------------------------------------------------------------
Well, before you read this text I think I need to write some notes about it.
First of all, this paper was born as a lecture i kept at Genova (Italy) in
April, 2004. The
T he slides I prepared were then adapted to become a full text,
but of course still have some characteristics of the original speech you
might not expect.
First of all, the speech was intended for a mixed audience, so it is this
text: this means that, while some of you might find some parts of this tute
interesting and others too hard, some others might become really bored before
they find something they're interested in, and other ones might find nothing
which deserves to be remembered here. My suggestion to everybody is to skip
RELATED TITLES
10 views 1 0

Full description
   
Hacks", by Tara Calishain and Kevin Hemenway (aka Morbus Iff), published by
O'Reilly, which gave me many interesting ideas and which everybody who wants
to create Web bots should read (and, why not, maybe BUY too, if you think
like me that the author deserves it). Between the many interesting info you
can find inside this book, there are some nice notes about bot netiquette
everyone should read (and follow).
==============================
============================================== ==================================
==================
=========
2. TECHNOLOGIES
-----------------------------------------------------------------------------
To understand what happens inside your PC when you download a Web page, you
should know the technologies the Web is based on. Inside this text I'll give
HTTP and HTML basics for granted, anyway I'll try to explain things while I'm
writing about them. Since, of course, most of what I'll write will be hard to
understand anyway, here are some links to websites you might find useful:
http://www.w3.org/MarkUp/ (everything you need to know about HTML)

http://www.w3.org/Protocols/ (everything you need to know about HTTP)
2.1 Why HTTP?

-----------------------------------------------------------------------------
To tell you the truth, you won't need to know all the details regarding this
protocol. Anyway, in the next sections of this text you will read something
about HTTP-related concepts, and knowing their meaning in advance will help
you much and let you understand what your Perl bots will be able to do.
Moreover, knowing what kind of data your computer will exchange with the
servers it connects to will help you to create more stable and more secure
bots.
GET and POST

------------
GET and POST "methods" are the first two terms we will learn: they represent
the two different ways your browser uses to ask servers for data. Without
delving too deeply into the details, the main features of the two methods
are the following ones:
RELATED TITLES
10 views 1 0

Full description
   
http://www.web.site/page.php?parameter1=value1&parameter2=value2
(you probably have read this kind of urls before, inside the address bar of
your browser). The amount of data you can send with a GET is quite limited,
also you should keep in mind that the params you send with a GET are
usually saved inside Web server logs (too).
- POST is used only when you want to send data to the Web server. The amount
of byte you can send is higher than with GET and inside server's logs you
will be able to find only the URL you POST to, and not the data you have
sent. This is quite important if, for example, you want to create some bots
which automatically authenticate and log inside a website: in this case
POST is better than GET, but keep in mind that, if the connection is in
clear, your login and pass won't be safe anyway (as they aren't now with
your browser).
Referer
-------
Among the many parameters that are usually sent together inside a GET or POST
request, whenever you reach a page following a link browsers usually send
your referer to the server, that is the URL of the website you're coming
from. This piece of information can be used by the server itself to check if
you're coming from one of its pages (and not, for example, from your hard
disk: countless hacks can be made just editing a web page locally!), or to
create statistics about how many users come from some particular websites.
User-Agent
----------
The User Agent is the software which connects to a server and communicates
with it, asking the pages you've choosen and receiving them. In practice,
anyway, most of the times this name is used for something else: it's the name
of the string used by the app (which could be a browser or any other piece of
software) to identify itself with the server.
Sometimes, this very string is used by the server to allow some programs to
access a page and to keep away others: this happens, for instance, with some
"optimized for Internet Explorer" websites. The most intelligent browsers
(if you ask, NO, Internet Explorer is not one of them) allow you to send
different User-Agent strings (custom or standard, ready made ones), and if
you plan to write some serious piece of Web software you should add this
feature too.
RELATED TITLES
10 views 1 0

Full description
   
this reason, cookies are often disliked by the ones who want to defend their
privacy. However, cookies are used almost everywhere now and some websites
don't even work if the application which connects there doesn't support them.
Proxy
-----
Proxies are programs which forward your Web apps requests to the desired
servers and return their answers to the clients. Their job can be easily
described with this:
+---------+ +---------+ +---------+

| | request | | request | |
| |--------------->| |--------------->| |
| CLIENT | | PROXY | | SERVER |
| |<---------------| |<---------------| |
| | answer | | answer | |
+---------+ +---------+ +---------+
Proxies are very useful for many different reasons:
- the client might not be able to access servers, but it might be authorized
to access the proxy: in this case, client apps could reach the server
anyway, passing their requests through the proxy
- some proxies have a cache, inside which the most frequently requested files
are saved. So, if the connection between the client and the proxy is much
faster than the one between client and server, then you might be able to
download cached files much faster
- some proxies don't tell the server where requests come from, so the clients
are able to connect anonymously to the Net
- later, we'll see how you can use proxies to get many more, and even more
interesting, advantages.
2.2 Why HTML?

-----------------------------------------------------------------------------
RELATED TITLES
10 views 1 0

Full description
   
Since inside dynamic websites forms are always present, you should spend some
time to understand how this technology works. To tell you the truth, it's not
such a hard work, anyway some experience in this field will help you not only
to easily understand the syntax and the meaning of what you see, but also to
understand how a whole website works. And without any particular intrusive
technique, but only with forms and html knowledge (and well, yes, some brain
too), you'll be able to create bots which can do much more than you can
imagine now.
================================================================
=========
3. PB TECHNIQUES: tools and basics
-----------------------------------------------------------------------------
Around the Net you can find lots of ready made PowerBrowsing tools for free
(or, if you prefer, sold for a lot of money), for any operating system.
Describing all of them in detail is impossible, so we'll try to group them
in categories and describe their main characteristics.
3.1 Alternative browsers

-----------------------------------------------------------------------------
Alternative browsers are, basically, all the ones which are not Internet
Explorer. Even if they're not the definitive solution to any problem, their
choice is the first step you can do to free your system from unrequested
contents, such as advertising banners, flash menus and popup windows.
Opera, for instance, allows you to toggle image loading in a Web page (or to
disable just non-cached images, such as banners) with a simple clic, while
the same operation with IE requires you to browse for a while inside all the
configuration menus. Opera also allows you to quickly toggle Javascript or
other dynamic contents and, in the same way, you can choose a custom view of
Web pages, with fonts and colors chosen by you instead the ones decided by
the page creators.
Between all the browsers with a GUI, I've heard very good comments about
Firebird. I haven't tried it enough to write about it here, so I'm waiting
for many feedbacks and "PowerBrowsing hints" about it. Anyway, without going
RELATED TITLES
10 views 1 0

Full description
   
they usually download the whole content of the pages they find;
- "scrapers" are programs which extract only some specific parts out of Web
pages. In fact they often have to download whole pages anyway, however they
can save on your disk only the data you're really interested in and not all
the pages' contents.
You won't probably find many ready-to-use spiders and scrapers around the Web
so easily, however there are some interesting ones: for instance, liberopop
(with all its variants) is a program which allows you to download with your
email client the messages you receive in your mailbox, which could otherwise
be browsed only on the Web. Of course, the problem of providers which first
offer free mail and then close their pop3 servers is not new, and different
solutions have been found, during the years, to fight this trend: just give a
look at "Perl@usa.net" paper by Blue (which you can find on any fravia's
mirror, in the bot -botstart.htm- section), dated 1999! Anyway, I liked
liberopop's structure
+---------+ +----------+ +---------+

| | POP3 req. | | HTTP req. | |
| |------------>| POP3 |------------>| |
| CLIENT | | EMULATOR | | SERVER |
| |<------------| |<------------| WEB |
| | POP3 ans. | | HTTP ans. | |
+---------+ +----------+ +---------+
which not only reminds a proxy very much, but is also almost identical to the
one I designed for an old app of mine (ANO - Another Non-working Offline
forum reader), which allowed you to read web forums with your email client.
If you're interested in this kind of apps, you might like one of my latest
projects, called TWO (The Working Offline forum reader), a scraper I'll
describe with a little more detail in section 7.4.
3.4 Proxy-like software

-----------------------------------------------------------------------------
The programs which fall into this category use the same architecture of proxy
RELATED TITLES
10 views 1 0

Full description
   
pages on the fly; in the same way, they can read the data you send to the
server and use them later, to automatically authenticate, log and browse into
a website.
An example of the first kind of proxy (that is, the one which filters data
coming from the server) is Proxomitron, a small but great Windows application
which allows you to filter Web pages, cutting away everything you don't like
(banners, popups, javascript, spyware) and completely changing their look.
Trying to create a platform-independent version of this program, a group of
reversers is working on a project, called Philtron, which aims to create a
PHP, Proxomitron-compatible application with even more functions. You can
find more about this project at these URLs:
http://philtron.sf.net (main project page)

http://fravia.2113.ch/phplab/mbs.php3 (PHPLabs)
http://fravia.2113.ch/phplab/mbs.php3/mb001 (Seeker's messageboard)
For what concerns the second proxy type (the one which works on data you send
to the server), a good example is the Web Scraping Proxy: this Perl app can
"record" everything you do inside a website, then it automatically creates
the source code needed to make a perl bot which will mimic all your actions.
To know something more about this program, check the website
http://www.research.att.com/~hpk/wsp
or give a look at the "hack" (number 30) inside Spidering Hacks.
Note: I've recently found another perl package which should do the same
thing. It's called HTTP::Recorder and you can easily find it at CPAN
(try the search engine at http://search.cpan.org).
================================================================
=========
4. PB TECHNIQUES: advanced
-----------------------------------------------------------------------------
More advanced PowerBrowsing techniques don't just use a single program, no

matter how advanced its options are: they usually take advantage of both the
knowledge acquired by users and the functions provided by one or more tools,
which can be ready made ones or created ad hoc. This way you get new, more
RELATED TITLES
10 views 1 0

Full description
   
experience, the better results will be. Anyway, even with the few, simple
suggestions which follow you'll be able to find and download much more easily
everything you like.
Speaking about examples, keep in mind that they are not supposed to be very
useful in practice. Moreover, given the speed with which Web information
change, some of them even might not work anymore when you read this document.
Don't worry too much about this: read them, understand them, try to change
them and be sure that you'll have something more than some lines of code.
4.1 Learn to search

-----------------------------------------------------------------------------
As you probably have understood yet, if you want to see only what you're
interested in you first have to _find_ what you're interested in. On the
other side, in some cases you might already know where that file you wanted
to download resides, but for some reason you don't want to connect to the
website it's stored in (for instance, because you have to pay to do that):
even in this case, knowing how to search will help you to find alternative
sites from which you'll be able to get the same file.
A first suggestion I'd give to anyone is to visit the good Searchlores site
(http://searchlores.org). Inside it you will be able to find many tutorials
about Web research and technologies, inside a place which is completely free
from advertisements, banners and commercial stuff. In particular, I suggest
you to give a look at "search webbits", ad hoc search strings for some file
or information categories.
Between these, the "index of" trick is one of the best ones, even if some
commercial websites are already trying to use it to attract searchers to
their pages. In practice, you can use it to restrict the Web search to those
directories which are open on the Web: these ones are nothing more than long
file lists, and always have at their beginning the "Index of <dir name>"
header and a link to their "parent directory".
Now, if for instance you want to download ringtones without paying a cent,
why do you have to get lost between dialer and commercial websites when you
can download all the songs you like in midi format? To find the websites
which share them freely, you just need to feed google with the following
string:
RELATED TITLES
10 views 1 0

Full description
   
"index of" "parent directory" fun .mpg
(change .mpg with your favorite video format) you can find all the videos you
like and, if you're lucky, even some websites which are regularly updated
with new funny resources.
If, instead, you want to try the same trick with mp3s, you'll probably find
lots of wrong results, which link you to commercial websites. This happens
because, as I told you before, once you start using frequently a trick "on
this side", this becomes used "on the other side" (the choice of which side
is the dark one is left as an exercise to the reader) to attract people where
they don't want to go. Fortunately, regardless of how many techniques they'll
try to use to trick us, we will always be one step before them ;)
Quite banally, if you want to take away most of the fake results from your
search, you can try to cut away the classic web pages extensions. If you see
you still get too many results, you can add some filters on specific terms:
"index of" "parent directory" .mp3 Iron Maiden -.html -.htm -faq
Here, for instance, we're searching some Iron Maiden mp3s, cutting away HTML
pages (which are not interesting for this research) and the results which
contain the "FAQ" word, because in some newsgroup's FAQ somebody talked about
both Iron Maiden and mp3s and someone else had the great idea of mirroring
them almost anywhere.
If, finally, you even know the song's title, you can try to insert its last
word, joined to the file extension, or add a part of the title to the search
strings:
"index of" "parent directory" Metallica frantic.mp3 -.html -.htm
Another suggestion I can give you about web searching is to use webtrackers.
Many Web sites use them to have access stats to study: what many don't know,
anyway, is that many commercial trackers are open to anyone and let you see,
between many different stats, referrers too. For instance, try to give a look
at my webtracker:
http://extremetracking.com/open?login=alam
RELATED TITLES
10 views 1 0

Full description
   
interested in and the name of a webtracker (or its URL, or a string which
uniquely identifies it) to start delving deep inside a mine full of
potentially interesting links.
A last suggestion, if you're searching a particular file and especially if

it's a copyrighted one, is to give a look to peer to peer channels first: the
download will probably be slower than the one from a Web site, but you'll
have much more chances to find a movie or a book, even if you have only a
couple of words from its title... And, once you have the filename, you might
even decide to get back to normal Web search.
If, while searching for a particular file on a p2p network, you happen to
find the names of groups which periodically release files of the same kind
(for instance, horror movies or comics or tv series or whatever else), write
them down: next time you'll have more chances to find what you're searching
for, using these names between your search strings. In the same way,
following (while paying attention, of course) the links you can find inside
the classic .nfo files, I happened to find some monothematic communities,
much more specialized and full of contents than any search engine I used.
4.2 Experiments with curl

-----------------------------------------------------------------------------
Inside some websites you might happen to find long lists of files of the same
type, whose filename has always the same prefix followed by an incremental
number. If the file list is visible (that is, browseable) from the Web, with
an HTML page, or because the directory where the files are stored is
accessible, the easiest method to download all the files is always wget:
wget -m -np http://web.site.url/directory/
If you want, you can specify the extension of the files you want to download:
wget -m -np -A <extension> http://web.site.url/directory/
Unfortunately, direct access to directories is often closed and many sites

don't give you a file index, but force users to download at least as many
pages (with unuseful images, banners and popups) as the files you want to
download. For this task, the most useful utility you can find is curl.
RELATED TITLES
10 views 1 0

Full description
   
http://web.site.url/directory/file[1-100].txt
http://web.site.url/directory/file[001-100].txt
http://web.site.url/directory/file[a-z].txt
http://web.site.url/directory/file[1-4]part{One,Two,Three}.txt
allow you to download all the files which begin the same way and continue
with, respectively,
- numbers from 1 to 100

- numbers from 1 to 100 (padded with zeroes to always have 3-digit numbers)
- letters from "a" to "z"
- numbers from 1 to 4, followed by "partOne", "partTwo", "partThree"
Curl has many more options than the ones described here. It also supports
various protocols (HTTP, HTTPS, FTP, GOPHER, DICT, TELNET, LDAP, FILE) and
can be used for file upload too (to know more about this, write "man curl"
inside your shell). However, it still has some limitations: for instance, it
still cannot efficiently manage filenames which contain dates inside them.
If you run this command
curl -LO http://web.site.url/dailystrips/[2000-2004][01-12][01-31].gif
you will download all the images you're interested in, but you'll send the
server more requests than you need, trying for example to download images
dated February, 30th or June 31st...
There are many techniques to solve this problem: between them, there are some
you will see inside the next section and that will allow you to automatically
download, every day, your favorite daily strips.
4.3 Wget and lynx oneliners

-----------------------------------------------------------------------------
Inside this section you will have the chance to see some oneliners which make
use of wget and lynx. They are the result of an old project I named "Browsing
the Web from the command line", which had some contributors inside RET forum
(http://www.reteam.org). The project has been inactive for a long time, but
since it's part of PowerBrowsing now you are free to contribute, sending
comments, requests or new experiments. Thanks in advance :)
RELATED TITLES
10 views 1 0

Full description
   
corrected are, probably, still many. The source code is provided "as is" and
you'll probably need some time to understand how it works and how it can be
improved. However, if you are interested, you can download TWO's source
code and documentation from http://two.sf.net. Let me know if you can make
something good out of it ;)
================================================================
=========
8. Examples
-----------------------------------------------------------------------------
In this section you'll have the chance to see and try some examples. To save
space, I've decided not to publish their source code here, but to insert a
link from which you can download them. If you can't connect there, you can
send me a mail and I'll answer you with an alternative URL.
- Common lib
http://3564020356.org/cgi-bin/perlcode.pl?file=common.pm
common.pm is a package I created when I was developing TWO. It contains all

the main functions I used to control a web bot behavior:
getpage is very similar to LWP::Simple "get" command, but it uses

LWP::UserAgent instead to obtain some more advantages: it can manage
cookies, a proxy, UserAgent identification and multiple retries before
it aborts a download.
exturl collects, inside an array, all the links it can find inside a Web
page and which satisfy one or more regular expressions inside the URL or
the tagged text: this allows you to follow links such as "all the files
whose name end in .txt" or "all the links whose text matches 'Next'".
walkpages is a recursive function which uses exturl to follow a list of

links and collect a list of others. It can work in different ways: it can
follow different links depending on the depth and it can collect only the
ones it finds in the last page, or all the ones it finds during its
travel.
walkpages_loop is the "looped" version of walkpages: that is, it follows

the same (or the same list of) links undefinitely (or for a specified
RELATED TITLES
10 views 1 0

Full description
   
NOTE: it uses common.pm
Cinemaz is a program I created for personal use, to automatically gather

data from http://www.monzacinema.it. This website shows all the movies you
can find this week in different cinemas, but if you want to choose a
particular movie, know where and when you have to go to see it, and find
a phone number to book a seat, well you have to click far too many times!
So, this bot connects to the main page and downloads all the pages which
describe the cinemas, extracting their name, the phone number, the movie
name and the timetable. All the information are shown in a good old plain
text file, which has everything you need... and only it.
With some little adaptations I managed to use the same script with procmail
and now, wherever I am, I just need to send my bot a mail with "cinemaz" as
subject to have an answer containing the very same text file :)
- Things I've learned from B-Movies

http://3564020356.org/cgi-bin/perlcode.pl?file=badmovies.pl
NOTE: it uses common.pm
http://www.badmovies.com is a very funny website, containing many B-movies

reviews. One of the funniest things, IMO, is that inside every movie page
there's a section, called "Things I've learned from B-Movies", with a list
of fortunes about (silly) things you can learn from that movie.
When you run the script it downloads the movie list with their links, it
chooses and follows a random one, then it extract all the quotes from the
"Things I've learned from B-Movies" section and finally, depending on how
you ran it, it shows all of them or just a random one, like the "fortune"
application.
- Malacomix
http://3564020356.org/cgi-bin/perlcode.pl?file=comics.pl
Malacomix connects to http://www.comics.com and lets you see, every single

day, your favorite comic strips. Since it uses the common syntax used by
the website to find all the different strips, you just have to write inside
the URL the names of the strips you want to see and it will generate an
RELATED TITLES
10 views 1 0

Full description
   
This script has been created with a particular purpose: to let everyone see
Happy Tree Friends episodes _the way they like_ (and not in a fixed-size
popup window) or download them on their hard disk. The script is very, very
easy, but it followed a more advanced "flash reversing" work. Maybe one
day, when you are not so tired because you've read this loooong text, I'll
explain you this one too ;)

Power Browsing by Malattia

Uploaded by

Copyright:

Available Formats

Power Browsing by Malattia

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Power Browsing by Malattia

Uploaded by

Copyright:

Available Formats

RELATED TITLES