Export mediawiki articles to static HTML or

Discussion:

Jan Steinman

2005-10-27 17:47:31 UTC

I've spend quite a time searching for a possibility to export articles
from the mediawiki to a static HTML tree or sth.

Have you tried wget? Should work under Linux. Sorry, I Don't Do
Windows(tm)!

:::: Our enemies are innovative and resourceful, and so are we. They
never stop thinking about new ways to harm our country and our
people, and neither do we. -- George W. Bush
:::: Jan Steinman <http://www.Bytesmiths.com/Events>

Sebastian Albrecht

2005-10-27 22:00:44 UTC

Permalink

Hello,

Post by Jan Steinman
Have you tried wget? Should work under Linux. Sorry, I Don't Do
Windows(tm)!

Me too ;)
This is what I've done so far and it's ok for me:

It is a sh script you can copy in a file wiki2html. Make it executable
using chmod and execute it.

It will get html files from a wiki using wget and it will try to get
some other files like f.e. main.css or the logo. It will also use SED to
replace absolute paths (/wiki/skins/...) inside css or javascript
elements of downloaded html pages. This is what wget won't do and what
will make the whole thing look a little better (than the printable format).

Please note this script is quite specific for my personal wiki and you
should have a look at it for using it yourself. The rejected strings of
the wget command can be certainly optimized.
DON'T try to use it at wikipedia because this will not even kill their
servers but your client.

Best regards,
Sebastian

#!/bin/sh
######################################################
#
# WIKI Export script - Wgets a wiki to static html.
#
######################################################

# Check input

if [ "$2" = "" ] ; then

echo "
$0 - Wgets a wiki to static html, 10/2005

This script does a wget to retrieve static html pages from a wiki.
Several wiki typical pages are excluded because they are unimportant
for offline usage (edit, history and special pages).
URLs in the html pages are changed automatically so you can
browse the static wiki offline.

Usage:
$0 <URL_to_wiki> <destination_dir> [<recursive_depth> default=2]

Examples:
$0 http://url/wiki ./wiki
$0 http://url/wiki ./wiki 3

Requires:
sed, wget
"
exit 1
fi

# Define input variables

URL=$1
DEST_DIR=$2
DEST_DIR_COMPLETE=$DEST_DIR/`echo "$URL" | sed 's/[a-zA-Z]*:\/\///g'`
REC_LEVEL=$3

if [ "$3" = "" -o "$3" -le "0" ] ; then
REC_LEVEL=2
fi

# WGET pages recursively

echo "

Post by Jan Steinman

Getting wiki pages to static html...
URL: $URL
Destination: $DEST_DIR

"

wget \
-nv \
--convert-links \
--page-requisites \
--html-extension \
--recursive \
--level=$REC_LEVEL \
--directory-prefix=$DEST_DIR \
--reject "*edit*,*history*,*Spezial*,*oldid*" \
$URL

# Get main.css for having a nicer static wiki

echo "

Post by Jan Steinman

Trying to get some files for more beauty (main.css, logo.png)...

"

wget \
-nv \
--directory-prefix=$DEST_DIR \
--recursive \
--level=1 \
$URL/skins/monobook/main.css

wget \
-nv \
--directory-prefix=$DEST_DIR \
--recursive \
--level=1 \
$URL/skins/common/images/wiki.png

# Find and replace absolute wiki css paths in static pages

echo "

Post by Jan Steinman

Replacing absolute wiki paths...

"

for FILE in `ls $DEST_DIR_COMPLETE/*.html` ; do
sed 's/\/wiki\/skin/skin/g' $FILE > $FILE.new;
done;

for FILE in `ls $DEST_DIR_COMPLETE/*.html` ; do
mv $FILE.new $FILE ;
done;

# Try copying index file

echo "

Post by Jan Steinman

Trying to copy index?index=Hauptseite.html to index.html to have
an easier entrance..."

cp $DEST_DIR_COMPLETE/*Hauptseite.html $DEST_DIR_COMPLETE/index.html

# DONE

echo "

Post by Jan Steinman

FINISHED! Look for the results at $DEST_DIR_COMPLETE
file://$PWD/$DEST_DIR_COMPLETE/

Anthony DiPierro

2005-10-27 22:12:49 UTC

Permalink

What's wrong with using dumpHTML.php?

Post by Sebastian Albrecht
Hello,

Post by Jan Steinman
Have you tried wget? Should work under Linux. Sorry, I Don't Do
Windows(tm)!

Me too ;)
It is a sh script you can copy in a file wiki2html. Make it executable
using chmod and execute it.
It will get html files from a wiki using wget and it will try to get
some other files like f.e. main.css or the logo. It will also use SED to
replace absolute paths (/wiki/skins/...) inside css or javascript
elements of downloaded html pages. This is what wget won't do and what
will make the whole thing look a little better (than the printable format).
Please note this script is quite specific for my personal wiki and you
should have a look at it for using it yourself. The rejected strings of
the wget command can be certainly optimized.
DON'T try to use it at wikipedia because this will not even kill their
servers but your client.
Best regards,
Sebastian
#!/bin/sh
######################################################
#
# WIKI Export script - Wgets a wiki to static html.
#
######################################################
# Check input
if [ "$2" = "" ] ; then
echo "
$0 - Wgets a wiki to static html, 10/2005
This script does a wget to retrieve static html pages from a wiki.
Several wiki typical pages are excluded because they are unimportant
for offline usage (edit, history and special pages).
URLs in the html pages are changed automatically so you can
browse the static wiki offline.
$0 <URL_to_wiki> <destination_dir> [<recursive_depth> default=2]
$0 http://url/wiki ./wiki
$0 http://url/wiki ./wiki 3
sed, wget
"
exit 1
fi
# Define input variables
URL=$1
DEST_DIR=$2
DEST_DIR_COMPLETE=$DEST_DIR/`echo "$URL" | sed 's/[a-zA-Z]*:\/\///g'`
REC_LEVEL=$3
if [ "$3" = "" -o "$3" -le "0" ] ; then
REC_LEVEL=2
fi
# WGET pages recursively
echo "

Post by Jan Steinman

Getting wiki pages to static html...
URL: $URL
Destination: $DEST_DIR

"
wget \
-nv \
--convert-links \
--page-requisites \
--html-extension \
--recursive \
--level=$REC_LEVEL \
--directory-prefix=$DEST_DIR \
--reject "*edit*,*history*,*Spezial*,*oldid*" \
$URL
# Get main.css for having a nicer static wiki
echo "

Post by Jan Steinman

Trying to get some files for more beauty (main.css, logo.png)...

"
wget \
-nv \
--directory-prefix=$DEST_DIR \
--recursive \
--level=1 \
$URL/skins/monobook/main.css
wget \
-nv \
--directory-prefix=$DEST_DIR \
--recursive \
--level=1 \
$URL/skins/common/images/wiki.png
# Find and replace absolute wiki css paths in static pages
echo "

Post by Jan Steinman

Replacing absolute wiki paths...

"
for FILE in `ls $DEST_DIR_COMPLETE/*.html` ; do
sed 's/\/wiki\/skin/skin/g' $FILE > $FILE.new;
done;
for FILE in `ls $DEST_DIR_COMPLETE/*.html` ; do
mv $FILE.new $FILE ;
done;
# Try copying index file
echo "

Post by Jan Steinman

Trying to copy index?index=Hauptseite.html to index.html to have
an easier entrance..."

cp $DEST_DIR_COMPLETE/*Hauptseite.html $DEST_DIR_COMPLETE/index.html
# DONE
echo "

Post by Jan Steinman

FINISHED! Look for the results at $DEST_DIR_COMPLETE
file://$PWD/$DEST_DIR_COMPLETE/

"
_______________________________________________
MediaWiki-l mailing list
http://mail.wikipedia.org/mailman/listinfo/mediawiki-l

Sebastian Albrecht

2005-10-27 22:24:54 UTC

Permalink

Hi Anthony,

Post by Anthony DiPierro
What's wrong with using dumpHTML.php?

Sorry I don't have a dumpHTML.php in my mediawiki folder. Is it in a new
version? I use 1.4.7.

Sebastian

Anthony DiPierro

2005-10-27 22:31:06 UTC

Permalink

Post by Sebastian Albrecht
Hi Anthony,

Post by Anthony DiPierro
What's wrong with using dumpHTML.php?

Sorry I don't have a dumpHTML.php in my mediawiki folder. Is it in a new
version? I use 1.4.7.
Sebastian

I've never used it. I just saw it mentioned on
http://static.wikipedia.org/. Tim Starling would be the right one to
ask for more info, but unless I'm
misunderstanding the problem it seems like it does exactly what is being
requested.

Anthony

jdd

2005-10-28 11:30:12 UTC

Permalink

Post by Sebastian Albrecht
Sorry I don't have a dumpHTML.php in my mediawiki folder. Is it in a new

found it in the maintenance directory (1.5 version)

must be run locally on the server

did nothing really nice for me

jdd

--
pour m'écrire, aller sur:
http://www.dodin.net

Eric K

2005-10-29 01:54:59 UTC

Permalink

The whole WWW allows underscores in usernames. Is there anyway to get around this restriction in Mediawiki ?

thank you

Eric

---------------------------------
Yahoo! FareChase - Search multiple travel sites in one click.

Brion Vibber

2005-10-29 05:50:34 UTC

Permalink

Post by Eric K
The whole WWW allows underscores in usernames. Is there anyway to get
around this restriction in Mediawiki ?

User names on a wiki are a special case of page names. Underscore is
reserved for space in page titles for backwards compatibility.

-- brion vibber (brion @ pobox.com)