Stateful programmatic web browsing in Python, after Andy Lester's Perl
module WWW::Mechanize
.
mechanize.Browser
is a subclass of
mechanize.UserAgent
, which is, in turn, a subclass of
urllib2.OpenerDirector
(in fact, of
mechanize.OpenerDirector
), so:
http:
mechanize.UserAgent
offers easy dynamic configuration of
user-agent features like protocol, cookie, redirection and
robots.txt
handling, without having to make a new
OpenerDirector
each time, eg. by calling
build_opener()
.
.back()
and .reload()
methods).
Referer
HTTP header is added properly (optional).
robots.txt
.
This documentation is in need of reorganisation and extension!
The two below are just to give the gist. There are also some actual working examples.
import re from mechanize import Browser br = Browser() br.open("http://www.example.com/") # follow second link with element text matching regular expression response1 = br.follow_link(text_regex=re.compile(r"cheese\s*shop"), nr=1) assert br.viewing_html() print br.title() print response1.geturl() print response1.info() # headers print response1.read() # body response1.close() # (shown for clarity; in fact Browser does this for you) br.select_form(name="order") # Browser passes through unknown attributes (including methods) # to the selected HTMLForm (from ClientForm). br["cheeses"] = ["mozzarella", "caerphilly"] # (the method here is __setitem__) response2 = br.submit() # submit current form response3 = br.back() # back to cheese shop (same data as response1) # the history mechanism returns cached response objects # we can still use the response, even though we closed it: response3.seek(0) response3.read() response4 = br.reload() # fetches from server for form in br.forms(): print form # .links() optionally accepts the keyword args of .follow_/.find_link() for link in br.links(url_regex=re.compile("python.org")): print link br.follow_link(link) # takes EITHER Link instance OR keyword args br.back()
You may control the browser's policy by using the methods of
mechanize.Browser
's base class, mechanize.UserAgent
.
For example:
br = Browser() # Don't handle HTTP-EQUIV headers (HTTP headers embedded in HTML). br.set_handle_equiv(False) # Ignore robots.txt. Do not do this without thought and consideration. br.set_handle_robots(False) # Don't handle cookies br.set_cookiejar() # Supply your own mechanize.CookieJar (NOTE: cookie handling is ON by # default: no need to do this unless you have some reason to use a # particular cookiejar) br.set_cookiejar(cj) # Log information about HTTP redirects and Refreshes. br.set_debug_redirects(True) # Log HTTP response bodies (ie. the HTML, most of the time). br.set_debug_responses(True) # Print HTTP headers. br.set_debug_http(True) # To make sure you're seeing all debug output: logger = logging.getLogger("mechanize") logger.addHandler(logging.StreamHandler(sys.stdout)) logger.setLevel(logging.INFO)
This note explains the relationship between mechanize, ClientCookie,
cookielib
and urllib2
, and which to use when.
import mechanize as ClientCookie
and
should continue to work.
cookielib
and extensions to module
urllib2
.
urllib2
(eg. HTTPRefreshProcessor
) may be used with 2.4's
urllib2
(however, note the paragraph below). With any
version of Python, urllib2
handlers that are missing from
mechanize (eg. ProxyHandler
) may be used with mechanize, and
urllib2.Request
objects may be used with mechanize.
IMPORTANT: For all other code, use mechanize
exclusively: do NOT mix use of mechanize and
urllib2
!
mechanize.RefreshProcessor
with Python
>= 2.4's urllib2
, you must also use
mechanize.HTTPRedirectHandler
.
mechanize.HTTPRefererProcessor
requires special support from
mechanize.Browser
, so cannot be used with vanilla
urllib2
.
mechanize.HTTPRequestUpgradeProcessor
and
mechanize.ResponseUpgradeProcessor
are not useful outside of
mechanize.
Full documentation is in the docstrings.
The documentation in the web pages is in need of reorganisation at the moment, after the merge of ClientCookie into mechanize.
Thanks to all the too-numerous-to-list people who reported bugs and provided
patches. Also thanks to Ian Bicking, for persuading me that a
UserAgent
class would be useful, and to Ronald Tschalar for advice
on Netscape cookies.
A lot of credit must go to Gisle Aas, who wrote libwww-perl, from which
large parts of mechanize originally derived, and Andy Lester for the original,
WWW::Mechanize
. Finally, thanks to the (coincidentally-named) Johnny Lee for the MSIE
CookieJar Perl code from which mechanize's support for that is derived.
Contributions welcome!
Browser.form_as_string()
and
Browser.__str__()
methods.
mechanize.UserAgent
methods and then .add_handler()
if need to give it a specific handler instance to use for one of the
things it UserAgent already handles.
You can install the old-fashioned way, or using EasyInstall. I recommend the latter even though EasyInstall is still in alpha, because it will automatically ensure you have the necessary dependencies, downloading if necessary.
Subversion (SVN) access is also available.
Since EasyInstall is new, I include some instructions below, but mechanize
follows standard EasyInstall / setuptools
conventions, so you
should refer to the EasyInstall and
setuptools
documentation if you need more detailed or up-to-date instructions.
The benefit of EasyInstall and the new setuptools
-supporting
setup.py
is that they grab all dependencies for you.
You need EasyInstall version 0.6a8 or newer.
easy_install mechanize
If you're on a Unix-like OS, you may need root permissions for that last step (or see the EasyInstall documentation for other installation options).
If you already have mechanize installed as a Python Egg (as
you do if you installed using EasyInstall, or using setup.py
install
from mechanize 0.0.10a or newer), you can upgrade to the latest
version using:
easy_install --upgrade mechanize
You probably want to read up on the -m
option to
easy_install
, which lets you install multiple versions of a
package.
easy_install "mechanize==dev"
Note that that will not necessarily grab the SVN versions of dependencies,
such as ClientForm: It will use SVN to fetch dependencies if and only if the
SVN HEAD version of mechanize declares itself to depend on the SVN versions of
those dependencies; even then, those declared dependencies won't necessarily be
on SVN HEAD, but rather a particular revision. If you want SVN HEAD for a
dependency project, you should ask for it explicitly by running
easy_install "projectname=dev"
for that project.
Note also that you can still carry on using a plain old SVN checkout as
usual if you like (optionally in conjunction with setup.py develop
– this is
particularly useful on Windows, since it functions rather like symlinks).
setup.py
should correctly resolve and download dependencies:
python setup.py install
Or, to get access to the same options that easy_install
accepts, use the easy_install
distutils command instead of
install
(see python setup.py --help easy_install
)
python setup.py easy_install mechanize
Note: this section is only useful for people who want to change mechanize: It is not useful to do this if all you want is to keep up with SVN.
For development of mechanize using EasyInstall (see the setuptools docs
for details), you have the option of using the develop
distutils
command. This is particularly useful on Windows, since it functions rather
like symlinks. Get the mechanize source, then:
python setup.py develop
Note that after every svn update
on a
develop
-installed project, you should run setup.py
develop
to ensure that project's dependencies are updated if required.
Also note that, currently, if you also use the develop
distutils command on the dependencies of mechanize to keep up with
SVN, you must run setup.py develop
for each dependency of
mechanize before running it for mechanize itself. As a result, in this case
it's probably simplest to just set up your sys.path manually rather than using
setup.py develop
.
One convenient way to get the latest source is:
easy_install --editable --build-directory mybuilddir "mechanize==dev"
All documentation (including this web page) is included in the distribution.
This is an alpha release: interfaces may change, and there will be bugs.
Development release.
For old-style installation instructions, see the INSTALL file included in the distribution. Better, use EasyInstall.
The Subversion (SVN) trunk is http://codespeak.net/svn/wwwsearch/mechanize/trunk, so to check out the source:
svn co http://codespeak.net/svn/wwwsearch/mechanize/trunk mechanize
The examples
directory in the source
packages contains a couple of silly, but working, scripts to demonstrate
basic use of the module. Note that it's in the nature of web scraping for such
scripts to break, so don't be too suprised if that happens – do let me
know, though!
It's worth knowing also that the examples on the ClientForm web page are useful for mechanize users, and are now real run-able scripts rather than just documentation.
To run the functional tests (which do access the network), run the following command:
python functional_tests.py
Note that ClientForm (a dependency of mechanize) has its own unit tests, which must be run separately.
To run the unit tests (none of which access the network), run the following command:
python test.py
This runs the tests against the source files extracted from the package. For help on command line options:
python test.py --help
There are several wrappers around mechanize designed for functional testing of web applications:
zope.testbrowser
(or
ZopeTestBrowser
, the standalone version).
Richard Jones' webunit (this is not the same as Steven Purcell's code of the same name). webunit and mechanize are quite similar. On the minus side, webunit is missing things like browser history, high-level forms and links handling, thorough cookie handling, refresh redirection, adding of the Referer header, observance of robots.txt and easy extensibility. On the plus side, webunit has a bunch of utility functions bound up in its WebFetcher class, which look useful for writing tests (though they'd be easy to duplicate using mechanize). In general, webunit has more of a frameworky emphasis, with aims limited to writing tests, where mechanize and the modules it depends on try hard to be general-purpose libraries.
There are many related links in the General FAQ page, too.
2.3 or above.
mechanize depends on ClientForm.
The versions of those required modules are listed in the
setup.py
for mechanize (included with the download). The
dependencies are automatically fetched by EasyInstall
(or by downloading a mechanize source package and
running python setup.py install
). If you like you can fetch
and install them manually, instead – see the INSTALL.txt
file (included with the distribution).
mechanize is dual-licensed: you may pick either the BSD license, or the ZPL 2.1 (both are included in the distribution).
mechanize.Browser
think otherwise
b = mechanize.Browser( # mechanize's XHTML support needs work, so is currently switched off. If # we want to get our work done, we have to turn it on by supplying a # mechanize.Factory (with XHTML support turned on): factory=mechanize.DefaultFactory(i_want_broken_xhtml_support=True) )
I prefer questions and comments to be sent to the mailing list rather than direct to me.
John J. Lee, May 2006.