Différences

Ci-dessous, les différences entre deux révisions de la page.

--- httrack [Le 06/02/2012, 17:20]
steph138 [Utilisation]
+++ httrack [Le 27/01/2024, 10:13] (Version actuelle)
bruno ancienne révision (Le 06/11/2021, 17:03) restaurée
@@ Ligne 1: / Ligne 1: @@
-{{tag>Hardy Karmic Lucid Maverick internet développement}}
+{{tag>Bionic internet programmation BROUILLON}}
 ----
@@ Ligne 5: / Ligne 5: @@
 ====== Aspiration de sites avec httrack ======
-httrack est un célèbre aspirateur de sites web.
+**Httrack** est un célèbre aspirateur de sites web.
-=== Avertissement ===
+<note warning>
-Le forum et la documentation d'Ubuntu-FR, de même que tous sites volumineux, **ne doivent pas** être aspirés automatiquement, sous peine de mise blocage de votre adresse IP par le site. L'aspiration de sites doit respecter une certaine éthique et doit être utilisée uniquement lorsqu'il y a un besoin d'accéder à certains contenus hors lignes. L'aspiration de sites exploite bien des ressources matérielles du site que vous téléchargez. Demandez l'autorisation au webmaster avant de procéder! N'oublions pas aussi que ça relève toujours de la proprieté intellectuelle.
+Les sites volumineux (le forum et la documentation Ubuntu-fr compris), **ne doivent pas** être aspirés automatiquement, sous peine de blocage de votre adresse IP par le site. L'aspiration de sites doit respecter une certaine éthique et doit être utilisée uniquement lorsqu'il y a un besoin d'accéder à des contenus hors lignes. L'aspiration demande au site visé des ressources matérielles bien plus importante que le simple affichage d'une page web. Demandez l'autorisation au webmaster avant d'agir ! N'oublions pas non plus les problématiques liées à la propriété intellectuelle.</note>
 ===== Installation =====
-Il existe 2 versions de httrack :
+Il existe deux versions de httrack :
-  * La version de base :  [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt://httrack|httrack]]** (dépôt Universe).
+  * La version de base :  [[:tutoriel:comment_installer_un_paquet|installez le paquet]] **[[apt>httrack]]**
-  * La version graphique, qui va utiliser votre navigateur préféré : [[:tutoriel:comment_installer_un_paquet|installez le paquet]]   **[[apt://webhttrack|webhttrack]]** (dépôt Universe).
+  * La version graphique, qui va utiliser votre navigateur préféré : [[:tutoriel:comment_installer_un_paquet|installez le paquet]]   **[[apt>webhttrack]]**.
+=====Utilisation=====
+httrack --mirror http://website.com
+httrack(1)                                                           General Commands Manual                                                          httrack(1)
-===== Utilisation =====
-Votre navigateur doit être fermé avant de lancer webhttrack.
-Nous allons nous intéresser ici à la version graphique, disponible dans le menu Applications => Internet => « WebHTTrack Website Copier »
-<note important>
+NAME
-Le raccourci créé dans le menu Applications comporte la commande "webhttrack browse". Il lance l'index des sites //déjà// enregistrés. Pour un premier lancement il faut lancer la commande <code>webhttrack</code>
+       httrack - offline browser : copy websites to a local directory
-</note>
-Votre navigateur ouvre alors un nouvel onglet.
+SYNOPSIS
+       httrack  [  url  ]...  [ -filter ]... [ +filter ]... [ -O, --path ] [ -w, --mirror ] [ -W, --mirror-wizard ] [ -g, --get-files ] [ -i, --continue ] [ -Y,
+       --mirrorlinks ] [ -P, --proxy ] [ -%f, --httpproxy-ftp[=N] ] [ -%b, --bind ] [ -rN, --depth[=N] ] [ -%eN, --ext-depth[=N] ] [ -mN,  --max-files[=N]  ]  [
+       -MN, --max-size[=N] ] [ -EN, --max-time[=N] ] [ -AN, --max-rate[=N] ] [ -%cN, --connection-per-second[=N] ] [ -GN, --max-pause[=N] ] [ -cN, --sockets[=N]
+       ] [ -TN, --timeout[=N] ] [ -RN, --retries[=N] ] [ -JN, --min-rate[=N] ] [ -HN, --host-control[=N] ] [ -%P, --extended-parsing[=N] ] [ -n, --near ] [  -t,
+       --test  ] [ -%L, --list ] [ -%S, --urllist ] [ -NN, --structure[=N] ] [ -%D, --cached-delayed-type-check ] [ -%M, --mime-html ] [ -LN, --long-names[=N] ]
+       [ -KN, --keep-links[=N] ] [ -x, --replace-external ] [ -%x, --disable-passwords ] [ -%q,  --include-query-string  ]  [  -o,  --generate-errors  ]  [  -X,
+       --purge-old[=N]  ]  [  -%p, --preserve ] [ -%T, --utf8-conversion ] [ -bN, --cookies[=N] ] [ -u, --check-type[=N] ] [ -j, --parse-java[=N] ] [ -sN, --roâ€
+       bots[=N] ] [ -%h, --http-10 ] [ -%k, --keep-alive ] [ -%B, --tolerant ] [ -%s, --updatehack ] [ -%u, --urlhack ] [ -%A, --assume ] [ -@iN, --protocol[=N]
+       ]  [  -%w,  --disable-module  ]  [  -F,  --user-agent ] [ -%R, --referer ] [ -%E, --from ] [ -%F, --footer ] [ -%l, --language ] [ -%a, --accept ] [ -%X,
+       --headers ] [ -C, --cache[=N] ] [ -k, --store-all-in-cache ] [ -%n, --do-not-recatch ] [ -%v, --display ] [ -Q, --do-not-log ] [  -q,  --quiet  ]  [  -z,
+       --extra-log  ]  [  -Z,  --debug-log  ]  [  -v,  --verbose  ]  [  -f, --file-log ] [ -f2, --single-log ] [ -I, --index ] [ -%i, --build-top-index ] [ -%I,
+       --search-index ] [ -pN, --priority[=N] ] [ -S, --stay-on-same-dir ] [ -D, --can-go-down ] [  -U,  --can-go-up  ]  [  -B,  --can-go-up-and-down  ]  [  -a,
+       --stay-on-same-address ] [ -d, --stay-on-same-domain ] [ -l, --stay-on-same-tld ] [ -e, --go-everywhere ] [ -%H, --debug-headers ] [ -%!, --disable-secuâ€
+       rity-limits ] [ -V, --userdef-cmd ] [ -%W, --callback ] [ -K, --keep-links[=N] ] [
+DESCRIPTION
+       httrack allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML,  images,
+       and other files from the server to your computer. HTTrack arranges the original site's relative link-structure. Simply open a page of the "mirrored" webâ€
+       site in your browser, and you can browse the site from link to link, as if you were viewing it online. HTTrack can also update an existing mirrored site,
+       and resume interrupted downloads.
+EXAMPLES
+       httrack www.someweb.com/bob/
+               mirror site www.someweb.com/bob/ and only this site
+       httrack www.someweb.com/bob/ www.anothertest.com/mike/ +*.com/*.jpg -mime:application/*
+               mirror the two sites together (with shared links) and accept any .jpg files on .com sites
+       httrack www.someweb.com/bob/bobby.html +* -r6
+              means get all files starting from bobby.html, with 6 link-depth, and possibility of going everywhere on the web
+       httrack www.someweb.com/bob/bobby.html --spider -P proxy.myhost.com:8080
+              runs the spider on www.someweb.com/bob/bobby.html using a proxy
+       httrack --update
+              updates a mirror in the current folder
+       httrack
+              will bring you to the interactive mode
+       httrack --continue
+              continues a mirror in the current folder
+OPTIONS
+   General options:
+       -O     path for mirror/logfiles+cache (-O path mirror[,path cache and logfiles]) (--path <param>)
+   Action options:
+       -w     *mirror web sites (--mirror)
+       -W     mirror web sites, semi-automatic (asks questions) (--mirror-wizard)
+       -g     just get files (saved in the current directory) (--get-files)
+       -i     continue an interrupted mirror using the cache (--continue)
+       -Y     mirror ALL links located in the first level pages (mirror links) (--mirrorlinks)
+   Proxy options:
+       -P     proxy use (-P proxy:port or -P user:pass@proxy:port) (--proxy <param>)
+       -%f    *use proxy for ftp (f0 don t use) (--httpproxy-ftp[=N])
+       -%b    use this local hostname to make/send requests (-%b hostname) (--bind <param>)
+   Limits options:
+       -rN    set the mirror depth to N (* r9999) (--depth[=N])
+       -%eN   set the external links depth to N (* %e0) (--ext-depth[=N])
+       -mN    maximum file length for a non-html file (--max-files[=N])
+       -mN,N2 maximum file length for non html (N) and html (N2)
+       -MN    maximum overall size that can be uploaded/scanned (--max-size[=N])
+       -EN    maximum mirror time in seconds (60=1 minute, 3600=1 hour) (--max-time[=N])
+       -AN    maximum transfer rate in bytes/seconds (1000=1KB/s max) (--max-rate[=N])
+       -%cN   maximum number of connections/seconds (*%c10) (--connection-per-second[=N])
+       -GN    pause transfer if N bytes reached, and wait until lock file is deleted (--max-pause[=N])
+   Flow control:
+       -cN    number of multiple connections (*c8) (--sockets[=N])
+       -TN    timeout, number of seconds after a non-responding link is shutdown (--timeout[=N])
+       -RN    number of retries, in case of timeout or non-fatal errors (*R1) (--retries[=N])
+       -JN    traffic jam control, minimum transfert rate (bytes/seconds) tolerated for a link (--min-rate[=N])
+       -HN    host is abandonned if: 0=never, 1=timeout, 2=slow, 3=timeout or slow (--host-control[=N])
+   Links options:
+       -%P    *extended parsing, attempt to parse all links, even in unknown tags or Javascript (%P0 don t use) (--extended-parsing[=N])
+       -n     get non-html files  near  an html file (ex: an image located outside) (--near)
+       -t     test all URLs (even forbidden ones) (--test)
+       -%L    <file> add all URL located in this text file (one URL per line) (--list <param>)
+       -%S    <file> add all scan rules located in this text file (one scan rule per line) (--urllist <param>)
+   Build options:
+       -NN    structure type (0 *original structure, 1+: see below) (--structure[=N])
+       -or    user defined structure (-N "%h%p/%n%q.%t")
+       -%N    delayed  type  check,  don  t  make  any link test but wait for files download to start instead (experimental) (%N0 don t use, %N1 use for unknown
+              extensions, * %N2 always use)
+       -%D    cached delayed type check, don t wait for remote type during updates, to speedup them (%D0 wait, * %D1 don t wait) (--cached-delayed-type-check)
+       -%M    generate a RFC MIME-encapsulated full-archive (.mht) (--mime-html)
+       -LN    long names (L1 *long names / L0 8-3 conversion / L2 ISO9660 compatible) (--long-names[=N])
+       -KN    keep original links (e.g. http://www.adr/link) (K0 *relative link, K absolute links, K4 original links, K3  absolute  URI  links,  K5  transparent
+              proxy link) (--keep-links[=N])
+       -x     replace external html links by error pages (--replace-external)
+       -%x    do not include any password for external password protected websites (%x0 include) (--disable-passwords)
+       -%q    *include query string for local files (useless, for information purpose only) (%q0 don t include) (--include-query-string)
+       -o     *generate output html file in case of error (404..) (o0 don t generate) (--generate-errors)
+       -X     *purge old files after update (X0 keep delete) (--purge-old[=N])
+       -%p    preserve html files  as is  (identical to  -K4 -%F "" ) (--preserve)
+       -%T    links conversion to UTF-8 (--utf8-conversion)
+   Spider options:
+       -bN    accept cookies in cookies.txt (0=do not accept,* 1=accept) (--cookies[=N])
+       -u     check document type if unknown (cgi,asp..) (u0 don t check, * u1 check but /, u2 check always) (--check-type[=N])
+       -j     *parse  Java  Classes  (j0  don  t  parse,  bitmask:  |1  parse  default,  |2  don  t  parse  .class  |4  don  t parse .js |8 don t be aggressive)
+              (--parse-java[=N])
+       -sN    follow robots.txt and meta robots tags (0=never,1=sometimes,* 2=always, 3=always (even strict rules)) (--robots[=N])
+       -%h    force HTTP/1.0 requests (reduce update features, only for old servers or proxies) (--http-10)
+       -%k    use keep-alive if possible, greately reducing latency for small files and test requests (%k0 don t use) (--keep-alive)
+       -%B    tolerant requests (accept bogus responses on some servers, but not standard!) (--tolerant)
+       -%s    update hacks: various hacks to limit re-transfers when updating (identical size, bogus response..) (--updatehack)
+       -%u    url hacks: various hacks to limit duplicate URLs (strip , www.foo.com==foo.com..) (--urlhack)
+       -%A    assume that a type (cgi,asp..) is always linked with a mime type (-%A php3,cgi=text/html;dat,bin=application/x-zip) (--assume <param>)
+       -can   also be used to force a specific file type: --assume foo.cgi=text/html
+       -@iN   internet protocol (0=both ipv6+ipv4, 4=ipv4 only, 6=ipv6 only) (--protocol[=N])
+       -%w    disable a specific external mime module (-%w htsswf -%w htsjava) (--disable-module <param>)
+   Browser ID:
+       -F     user-agent field sent in HTTP headers (-F "user-agent name") (--user-agent <param>)
+       -%R    default referer field sent in HTTP headers (--referer <param>)
+       -%E    from email address sent in HTTP headers (--from <param>)
+       -%F    footer string in Html code (-%F "Mirrored [from host %s [file %s [at %s]]]" (--footer <param>)
+       -%l    preffered language (-%l "fr, en, jp, *" (--language <param>)
+       -%a    accepted formats (-%a "text/html,image/png;q=0.9,*/*;q=0.1" (--accept <param>)
+       -%X    additional HTTP header line (-%X "X-Magic: 42" (--headers <param>)
+   Log, index, cache
+       -C     create/use a cache for updates and retries (C0 no cache,C1 cache is prioritary,* C2 test update before) (--cache[=N])
+       -k     store all files in cache (not useful if files on disk) (--store-all-in-cache)
+       -%n    do not re-download locally erased files (--do-not-recatch)
+       -%v    display on screen filenames downloaded (in realtime) - * %v1 short version - %v2 full animation (--display)
+       -Q     no log - quiet mode (--do-not-log)
+       -q     no questions - quiet mode (--quiet)
+       -z     log - extra infos (--extra-log)
+       -Z     log - debug (--debug-log)
+       -v     log on screen (--verbose)
+       -f     *log in files (--file-log)
+       -f2    one single log file (--single-log)
+       -I     *make an index (I0 don t make) (--index)
+       -%i    make a top index for a project folder (* %i0 don t make) (--build-top-index)
+       -%I    make an searchable index for this mirror (* %I0 don t make) (--search-index)
+   Expert options:
+       -pN    priority mode: (* p3) (--priority[=N])
+       -p0    just scan, don t save anything (for checking links)
+       -p1    save only html files
+       -p2    save only non html files
+       -*p3   save all files
+       -p7    get html files before, then treat other files
+       -S     stay on the same directory (--stay-on-same-dir)
+       -D     *can only go down into subdirs (--can-go-down)
+       -U     can only go to upper directories (--can-go-up)
+       -B     can both go up&down into the directory structure (--can-go-up-and-down)
+       -a     *stay on the same address (--stay-on-same-address)
+       -d     stay on the same principal domain (--stay-on-same-domain)
+       -l     stay on the same TLD (eg: .com) (--stay-on-same-tld)
+       -e     go everywhere on the web (--go-everywhere)
+       -%H    debug HTTP headers in logfile (--debug-headers)
+   Guru options: (do NOT use if possible)
+       -#X    *use optimized engine (limited memory boundary checks) (--fast-engine)
+       -#0    filter test (-#0  *.gif   www.bar.com/foo.gif ) (--debug-testfilters <param>)
+       -#1    simplify test (-#1 ./foo/bar/../foobar)
+       -#2    type test (-#2 /foo/bar.php)
+       -#C    cache list (-#C  *.com/spider*.gif  (--debug-cache <param>)
+       -#R    cache repair (damaged cache) (--repair-cache)
+       -#d    debug parser (--debug-parsing)
+       -#E    extract new.zip cache meta-data in meta.zip
+       -#f    always flush log files (--advanced-flushlogs)
+       -#FN   maximum number of filters (--advanced-maxfilters[=N])
+       -#h    version info (--version)
+       -#K    scan stdin (debug) (--debug-scanstdin)
+       -#L    maximum number of links (-#L1000000) (--advanced-maxlinks)
+       -#p    display ugly progress information (--advanced-progressinfo)
+       -#P    catch URL (--catch-url)
+       -#R    old FTP routines (debug) (--repair-cache)
+       -#T    generate transfer ops. log every minutes (--debug-xfrstats)
+       -#u    wait time (--advanced-wait)
+       -#Z    generate transfer rate statictics every minutes (--debug-ratestats)
+   Dangerous options: (do NOT use unless you exactly know what you are doing)
+       -%!    bypass built-in security limits aimed to avoid bandwidth abuses (bandwidth, simultaneous connections) (--disable-security-limits)
+       -IMPORTANT
+              NOTE: DANGEROUS OPTION, ONLY SUITABLE FOR EXPERTS
+       -USE   IT WITH EXTREME CARE
+   Command-line specific options:
+       -V     execute system command after each files ($0 is the filename: -V "rm \$0") (--userdef-cmd <param>)
+       -%W    use an external library function as a wrapper (-%W myfoo.so[,myparameters]) (--callback <param>)
+   Details: Option N
+       -N0    Site-structure (default)
+       -N1    HTML in web/, images/other files in web/images/
+       -N2    HTML in web/HTML, images/other in web/images
+       -N3    HTML in web/,  images/other in web/
+       -N4    HTML in web/, images/other in web/xxx, where xxx is the file extension (all gif will be placed onto web/gif, for example)
+       -N5    Images/other in web/xxx and HTML in web/HTML
+       -N99   All files in web/, with random names (gadget !)
+       -N100  Site-structure, without www.domain.xxx/
+       -N101  Identical to N1 exept that "web" is replaced by the site s name
+       -N102  Identical to N2 exept that "web" is replaced by the site s name
+       -N103  Identical to N3 exept that "web" is replaced by the site s name
+       -N104  Identical to N4 exept that "web" is replaced by the site s name
+       -N105  Identical to N5 exept that "web" is replaced by the site s name
+       -N199  Identical to N99 exept that "web" is replaced by the site s name
+       -N1001 Identical to N1 exept that there is no "web" directory
+       -N1002 Identical to N2 exept that there is no "web" directory
+       -N1003 Identical to N3 exept that there is no "web" directory (option set for g option)
+       -N1004 Identical to N4 exept that there is no "web" directory
+       -N1005 Identical to N5 exept that there is no "web" directory
+       -N1099 Identical to N99 exept that there is no "web" directory
+   Details: User-defined option N
+          %n  Name of file without file type (ex: image)
+          %N  Name of file, including file type (ex: image.gif)
+          %t  File type (ex: gif)
+          %p  Path [without ending /] (ex: /someimages)
+          %h  Host name (ex: www.someweb.com)
+          %M  URL MD5 (128 bits, 32 ascii bytes)
+          %Q  query string MD5 (128 bits, 32 ascii bytes)
+          %k  full query string
+          %r  protocol name (ex: http)
+          %q  small query string MD5 (16 bits, 4 ascii bytes)
+             %s?  Short name version (ex: %sN)
+          %[param]  param variable in query string
+          %[param:before:after:empty:notfound]  advanced variable extraction
+   Details: User-defined option N and advanced variable extraction
+          %[param:before:after:empty:notfound]
+       -param : parameter name
+       -before
+              : string to prepend if the parameter was found
+       -after : string to append if the parameter was found
+       -notfound
+              : string replacement if the parameter could not be found
+       -empty : string replacement if the parameter was empty
+       -all   fields, except the first one (the parameter name), can be empty
-  * Choisissez la langue (Français).
-  * Faites « Suivant ».
-  * Choisissez le nom du projet, la catégorie et surtout le dossier.
-  * Choisissez « Copie Automatique de Site-web », et entrez l'adresse du site dans la case.
-  * Faites « Suivant » puis « Terminer ».
-  * C'est fait !
-Des options plus avancées sont disponibles. N'hésitez-pas à expérimenter !
-Par exemple, si vos sites aspirés présentent des défauts dans les images téléchargées, cela peut venir du nombre de connexions trop importantes. Ramener à 2 ou 1.
 ===== Utilisation en ligne de commande =====
-Crée un miroir:
+Crée un miroir :
 <code>httrack --mirror http://www.monsite.com</code>
-Mettre à jour le projet courant:
+Mettre à jour le projet courant :
 <code>httrack --update</code>
-Nettoyage du cache et fichier log:
+Nettoyage du cache et fichier log :
 <code>httrack --clean</code>