
CGIProxy 1.4.1  (released March 8, 2001)

HTTP/FTP Proxy in a CGI Script

(c) 1996, 1998-2001 by James Marshall, james@jmarshall.com
For the latest, see http://www.jmarshall.com/tools/cgiproxy/

------------------------------------------------------------------------
This CGI script acts as an HTTP or FTP proxy.  Through it, you can can
retrieve any resource that is accessible from the server this runs on.
This is useful when your own access is limited, but you can reach a server
that can in turn reach others that you can't.  By default, no user info
(except browser type) is sent to the target server, so you can set up your
own anonymous proxy like The Anonymizer (http://www.anonymizer.com/).

Whenever an HTML resource is retrieved, it's modified so that all URLs
in it point back through the same proxy, including form submissions.
Once you're using the proxy, you can (almost) forget it's there.

Configurable options include cookie support, text-only proxying (to
save bandwidth), simple ad filtering, script removal, custom encoding
of target URLs, and more.

Requires Perl 5 as shipped, but can use Perl 4 with a simple config change.

The original seed for this was a program I wrote for Rich Morin's
article in the June 1996 issue of Unix Review, online at
http://www.cfcl.com/tin/P/199606.shtml.

IMPORTANT NOTE ABOUT ANONYMOUS BROWSING:
CGIProxy was originally made for indirect browsing more than anonymity,
but since people are using it for anonymity, I've tried to make it as
anonymous as possible.  Suggestions welcome.  For best anonymity, browse
with JavaScript turned off.

Anonymity is pretty good, but may not be bulletproof.  For example, if
even a single JavaScript statement can be run, your anonymity can be
compromised.  I've tried to remove JS from every place it can exist, but
please tell me if I missed any.  Also, browser plugins or other executable
extensions may be able to reveal you to a server.

------------------------------------------------------------------------
LEGAL DISCLAIMER:

Censorship is a controversial subject, and some governments and companies
have rules about what information you should have access to.  If you use
my software to bypass rules that have been imposed on you, you assume all
legal risks and responsibilities involved.  I'm providing the software as
a demonstration and teaching tool, and for when legitimate access is
needed to non-accessible servers.  I won't encourage you to break any
rules, because I would get in trouble if I did.  I can't prevent you from
using this software in illegitimate ways, but I believe the value of it as
a teaching tool is far too great to let a few miscreants ruin it for
everybody.

------------------------------------------------------------------------
TO INSTALL:

To run this, your server must support Non-Parsed Header (NPH) CGI scripts.
Most servers do, but not all.  (Starting in version 1.3.2, there may be a
way to run this script without NPH support; see the notes about
$NOT_RUNNING_AS_NPH below and in the source code.)

Quick answer:  Unpack the script and call it.

Longer answer:

  1) Unpack the distribution.
  2) Install the script like any other CGI script (set permissions and
     path to the Perl interpreter).  Be sure the it's installed as an
     NPH script.  In Apache and related servers, do this by starting the
     filename with "nph-". 
  3) Set any of these that are required for your installation (none for
     most people):
       . If this is running on an SSL server, set $RUNNING_ON_SSL_SERVER=1.
       . If this is running on a Windows server, set $RUNNING_ON_WINDOWS=1.
       . If this proxy uses another HTTP proxy (like a firewall), set
           $ENV{'http_proxy'} and $ENV{'no_proxy'}.  If that proxy uses
           authentication, set $PROXY_AUTH.
       . To use Perl 4 instead of Perl 5, see the instructions by the 
           "use Socket" line.
  4) Set any desired options in the script:
       . To restrict forwarded data to text only, set $TEXT_ONLY=1.
       . To remove all cookies, set $REMOVE_COOKIES=1.  Alternately, you
           can allow and ban cookies from specific servers, with the lists
           @ALLOWED_COOKIE_SERVERS and @BANNED_COOKIE_SERVERS.
       . To remove all script content, set $REMOVE_SCRIPTS=1 (on by
           default).  Alternately, you can allow and ban scripts from
           specific servers, with the lists @ALLOWED_SCRIPT_SERVERS and
           @BANNED_SCRIPT_SERVERS.  Note that this removes most popup ads,
           and helps greatly with anonymity.
       . To filter ads and ad-related cookies, set $FILTER_ADS=1.
           You can customize this behavior with a few related settings.
       . To display a URL entry form on each page, set $INSERT_ENTRY_FORM=1.
       . To let users choose their own $REMOVE_COOKIES, $REMOVE_SCRIPTS,
           $FILTER_ADS, and $INSERT_ENTRY_FORM, set $ALLOW_USER_CONFIG=1.
       . To customize the encoding format for target URLs, modify the 
           &proxy_encode() and &proxy_decode() routines.
       . If you want to restrict access to only certain destination
           servers, set @ALLOWED_SERVERS and @BANNED_SERVERS.
       . To insert your own block of HTML into each page, set either
           $INSERT_HTML or $INSERT_FILE.
       . If you absolutely, positively, do not have NPH support on your
           server, you may be able to run this script as a normal non-NPH
           script:  Set $NOT_RUNNING_AS_NPH=1, and read the comments above 
           it in the source code for possible dangers.
       . For crude load-balancing among a set of proxies, set @PROXY_GROUP.
       . Other minor config is possible; see the user configuration section.
       . If heavy use of this proxy puts a load on your server, see
           "NOTES ON PERFORMANCE" in the source code.

------------------------------------------------------------------------
TO USE:

Call the script directly to start a browsing session.  Once you've gotten
a page through the proxy, everything it links to will automatically go
through the proxy.  You can bookmark pages you browse to, and your
bookmarks will go through the proxy as they did before.

------------------------------------------------------------------------
YOU CAN HELP IMPROVE THIS PROXY BY TELLING ME:

1) Any HTML tags with URLs not being converted, including non-standard ones.

2) Any method of introducing JavaScript or other script content that's not
being filtered out.

3) Any script MIME types not being filtered out.

4) Any HTML-like MIME types other than text/html, that contains links with
URLs that need to be converted.


Please verify you're using the latest version of CGIProxy before emailing me.

------------------------------------------------------------------------
LIMITS 'N' BUGS:

Anonymity is NOT PERFECT!!  In particular, there may be some holes where
JavaScript can slip through.  For best anonymity, turn JavaScript off.

URLs generated by JavaScript or similar mechanisms won't be re-proxy'ed
correctly.  JavaScript in general may not work as expected.

If you browse to many sites with cookies, CGIProxy may drop some, but I
haven't seen this happen yet.

To save CPU time, I took some shortcuts with URL-handling.  I doubt these
will ever affect anything, but tell me if you have problems. (The shortcuts
are listed in the source code.)

I didn't check the spec on HTTP proxies when I first wrote this (sometime
in 1996).  It's possible the protocol is violated.  Actually, this whole
concept is a violation of the proxy model, so I'm not too worried.  If any
protocol violations cause you problems, please let me know.

Only HTTP and FTP are supported so far.

========================================================================

CH'CH'CH'CH'CHANGES:
--------------------


1.4.1 and 1.4.1-SSL, released March 8, 2001:
------------------------------

CPU load was decreased 15% with two simple changes that I should have thought
of long ago.

Fixed error with <meta> "refresh" tags that caused proxy to loop through
itself.

Fixed problem with user-chosen URL entry form.



1.4-SSL, released February 22, 2001:
------------------------------------

This is a special version that can retrieve pages from SSL servers.  It
is based on version 1.4, and otherwise works pretty identically to that
release.



1.4, released February 10, 2001:
----------------------------------

You can now optionally insert a compact version of the initial entry form
into the top of every downloaded page, by setting $INSERT_ENTRY_FORM=1.  The
form also displays the URL you're currently viewing.  This is selectable by
the user like the three original user-selectable options.  Frames are handled
correctly, i.e. it's not inserted in frames, because that would be really
ugly.  That was the hard part.

You can also insert your own block of HTML if desired: specify it either as a
fixed string in $INSERT_HTML, or name a file to be inserted in $INSERT_FILE.

For consistency, the URL format was changed slightly-- the initial flags in
PATH_INFO are always there and are now five in number.  So any bookmarks
saved through the proxy will have to be converted or recreated.  If there's
enough demand, I can write a simple converter each time I change the URL
format.

The user-entered hostname is now always lowercased, since host names are
case-insensitive.

Fixed a minor bug in 1.3.2 having to do with PATH_INFO encoding.



1.3.2, released February 3, 2001:
---------------------------------

By popular demand, you can now restrict which servers the proxy can access,
like the online demo does.  This is configured with the lists
@ALLOWED_SERVERS and @BANNED_SERVERS.

For FTP transfers, a "Content-Length:" header is now returned when
guessable.  This lets some browsers show you the percentage progress.

Pseudo-headers created by <meta http-equiv> tags are now handled like real
HTTP headers.  Internally, the handling of HTTP headers has been cleaned up.

If you absolutely, truly, can't run NPH scripts on your server, there is
now an option to run as best as possible as a normal non-NPH CGI script.
For this to work, your server MUST support the "Status:" CGI response
header.  All servers are supposed to support it, but not all do.

There is now a $NO_BROWSE_THROUGH_SELF option which prevents the proxy from
calling itself, which is usually a mistake anyway.

Proxy authentication (the "Proxy-Authorization:" request header) is now
supported in a limited way.

Regexes have been improved to match tag attributes better, and a related
privace hole was fixed.

URLs with spaces (which are a bad idea anyway) are now more likely to be
handled as expected.



1.3.1, released June 6, 2000:
-----------------------------

Script now runs correctly under mod_perl (requires at least Perl 5.004).

Script now runs correctly on an SSL server, if $RUNNING_ON_SSL_SERVER is set.

Main URL-conversion loop runs almost twice as fast (40% less CPU time), with
a fix I should have noticed a long time ago.

Login for HTTP Basic authentication is now submitted with POST instead of
GET, for better security.

Fixed privacy hole when servers didn't return Content-Type: header.



1.3, released April 8, 2000:
-----------------------------

Anonymity has been improved, especially regarding JavaScript or other
script content.  Before it was an afterthought; now it's being implemented
as completely as possible.  If you know any anonymity holes, please tell
me.  I'm especially interested in knowing any MIME types that identify
scripts.

In particular:

. Much more JavaScript is filtered out than before.  As far as I know, all
  of it is removed:  in <script> blocks, in style sheets, in HTML attributes,
  wherever indicated by HTTP headers, and other places.

. By default, $REMOVE_SCRIPTS is set to true.

. A potential privacy hole from a bug in Internet Explorer is protected.

You can now select which servers to allow scripts from, by setting
@ALLOWED_SCRIPT_SERVERS and @BANNED_SCRIPT_SERVERS.


If several people share a proxy, they can customize their own settings if
you set $ALLOW_USER_CONFIG.

Large files and streaming media are now supported, by transmitting the
data from the server as it arrives, rather than receiving the whole
resource before sending it to the client.  This works for both HTTP and
FTP.

HTTP Basic authentication is now supported.  (I sure hope people use it,
because it's the most elaborate and convoluted hack of the whole program.)

There's an experimental load-balancing feature.  If you set @PROXY_GROUP
to a set of URL's of cooperating proxies, they'll randomly distribute the
load among them.  This may help or hinder privacy, and it may have other
uses too.  Let me know if you find those uses.

For those who like to mess with the code, there are some neat new internal
mechanisms.  Cookies have now been extended to handle multiple tasks (had
to for Basic authentication), and there's a new internally-handled URL
scheme "x-proxy" that lets you plug in whatever magic functionality you
want (had to for Basic authentication).

There is no longer a startproxy.cgi.  It was swallowed by the main script.
It was becoming a vanishingly small percentage of the overall code.

More HTML tags are transformed, whichever non-standard tags people
reported to me (thanks!).

A couple of non-standard HTTP headers with URLs are now transformed
correctly.

If you're running on Windows, you can now set a configuration flag, and
CGIProxy will work around a couple problems on that platform.

$SUPPORT_COOKIES has been reversed, and renamed to $REMOVE_COOKIES.  This
makes it more analogous to $REMOVE_SCRIPTS, and each have their
@ALLOWED...  and @BANNED... server lists.

@BANNED_COOKIE_SERVERS and $NO_COOKIE_WITH_IMAGE have changed slightly--
they now take effect even when $FILTER_ADS isn't set.  They're more
associated with the $REMOVE_COOKIES flag now.

The initial URL-entry form may now submit using POST instead of GET,
based on the setting of $USE_POST_ON_START.  This is because some
filters apparently search outgoing URIs, but not POST request bodies.

FTP now follows symbolic links correctly, and another FTP bug or two were
fixed.



1.2, released September 11, 1999:
---------------------------------

The internal structure was rearranged in a big way, to support multiple
protocols more cleanly.  Previously, HTTP was ingrained throughout; now
it's more modular.

FTP is now supported.

@ALLOWED_COOKIE_SERVERS lets you only accept cookies from certain servers.

@BANNED_COOKIE_SERVERS and @ALLOWED_COOKIE_SERVERS are now lists of Perl
patterns (regular expressions) to match, rather than literal host names.
This lets you allow or forbid whole sets of servers rather than listing
each server individually.  For more information on Perl patterns, read the
Perl documentation.  nph-proxy.cgi has a note in the user config section
that may help enough.

You can remove scripts from HTML pages by setting $REMOVE_SCRIPTS=1.  This
helps with anonymity somewhat by removing some JavaScript (but not all!).
It also removes most popup ads.  :)

The HEAD method is now supported more cleanly.

Rare net_path form of relative URL (i.e. like "//host.com/path/etc") is
now supported, for completeness and safety.

The default lists of cookie and ad servers are a bit better.



1.1, released March 9, 1999:
----------------------------

The whole format of the target URL in PATH_INFO was restructured.  It can
be encoded however the user wishes.  This gets around PATH_INFO clashes in
various servers, solving most problems regarding server incompatibilities
I've heard about.

Cookies are now optionally supported (but off by default).

Banner ads can be filtered out.  Only a simple set of URL patterns are
filtered out by default, but it's easy to add more entries to
@BANNED_IMAGE_URL_PATTERNS.

Cookies from ad servers are filtered out (at least the main ones).  Again,
the default list in @BANNED_COOKIE_SERVERS is simple, but you can easily
add more.

Binary files are no longer getting messed up on Windows.

More HTTP headers are fixed to point back through the proxy.

Under some conditions in 1.0, extra processes would hang around for hours
and drag the system.  Alex Freed added a timeout to solve this for now.  I
can't reproduce the problem, so any info is appreciated.  [9-9-1999: It
may be a bug in older Apaches, fixed by upgrading to Apache 1.3.6 or
better.  Julian Haight reports the same problem with other scripts on
Apache 1.3.3, but not with Apache 1.3.6.]

Internally: code was cleaned up, URL-parsing was improved, and relative
URL calculation was redone.



1.0, released August 3, 1998:
-----------------------------

Initial release.


========================================================================

Last Modified: March 8, 2001
http://www.jmarshall.com/tools/cgiproxy/

