#! /usr/bin/perl


=head1 NAME

php2media_wiki.pl -- Try to convert phpwiki pages into a mysql mediawiki
database


=head1 AUTHOR

theocrite, random guy
mail : theocrite _ theocrite _ org
xmmp : theocrite _ jabber _ fr
theocrite@freenode


=head1 DESCRIPTION

This script will try to convert phpwiki files from a snapshot to a mediawiki
syntax and put it in a mysql database.

There are 5 steps : 

	- prepare : Just do some basic stuff (mkdir, unzip, rm) and make sure
	everything is ready berfore we can start

	- encode : makes sure encoding/format is acceptable
	(no more Windows/Dos encoding/format)

	- convert : do the actual conversion (translate from phpwiki to
	mediawiki syntax)

	- insert : insert everything in a mysql db

	- congrats : Everything worked fine, you can hug yourself


=head1 SYNOPSIS

perl php2media_wiki.pl
./php2media_wiki.pl

There are no inline options. Everything is set using vars defined inside the
script.

=head1 USAGE

Make sure you have a snapshot in the parent of the working dir (default /tmp),
that you have the right permissions and that the vars are set according to what
you want to do. Then just run the script as shown above.


=head1 REQUIREMENTS

* libiconv

* convmv

* tofrodos

* unzip

* perl (obviously) and the required modules

* A working mediawiki install and an access (SELECT, INSERT, UPDATE) on the
database. DELETE can be added if you plan to let the script emptying the table
before refilling it.

mysql> GRANT SELECT, INSERT, UPDATE,( DELETE,) ON mediawiki.* TO
mediawiki@server.example.com IDENTIFIED BY 'mediawiki';

* A fully functionnal brain could be useful too


=head1 CONFIGURATION

See the vars at the beginnig of the script (right after the doc). the names are
pretty explicit : 

* $snapshot_name : What's the name of the snapshot ? (the default value should
be ok in most cases)

* $snapshot_where : The full path.

* $working_dir : Where should i put all the files while working ?

* $origin : The directory were the snapshot will be unziped.

* $encoded : This is where files encoded in utf8 and converted in UNIX format
will be stored.

* $translated : Dir where the files converted from phpwiki to mediawiki
syntax will be stored.

* $dbh is the database handler. Be sure that the user/password and host are set
according to what you need.


=head1  SEE ALSO

Here are some links about scripts/methods to do the same thing we are trying
to do here. It didn't work when i tried, that's why i bothered writing it from
scratch (well, to be honest not really from scratch, it helped me a lot).

http://wikiworld.com/wiki/index.php/PhpWiki2MediaWiki

http://www.webforce.co.nz/phpwiki2mediawiki.php.txt

http://www.mediawiki.org/wiki/Manual:PhpWiki_conversion


=head1 KNOWN BUGS


=head2 * History is lost

History is lost, only the latest snapshot is converted. Old revisions and
associated authors aren't. This is due to the phpwiki and mediawiki respective
formats. To keep history, we would need a tool way more clever and way more
complicated.


=head2 * Plugings are lost

Plugins pages are lost too (or miss interpreted by mediawiki)

eg : Calendars, Blogs, Tex, php color...

Not much we can do here. We would need a tool able to make a full convertion of
all the plugins. But this would require lot of energy for something not really
valuable.


=head2 * Pages hierarchy are lots (deleted)

Pages used like directories arent handled propretly (yet).

ZB : http://www.example.com/meetings/april2008  will fail
     http://www.example.com/meetings            will succeed

Reason : When trying to unzip the snapshot, pages like this will have a main
page (meetings in the example) unziped. Il will then be impossible to unzip a
'meetings' directory and fill it with files (like april2008).

There is an easy workaround, if you really need to keep the pages (hierarchy is
still lost). Just apply the following patch : 

==========

@@ -161,6 +161,7 @@
         $name    = Archive::Zip::_asLocalName($name);
     }
     if ( $dirName && !-d $dirName ) {
+       rmtree $dirName, {verbose => 1};
         mkpath($dirName);
         return _ioError("can't create dir $dirName") if ( !-d $dirName );
     }

==========

On the file Archive.pm (on Debian, package libarchive-zip-perl : 
/usr/share/perl5/Archive/Zip/Archive.pm)


=head2 * Conversion of tables may fail

Table conversion isn't perfect and may fail when using colspan and/or rowspan.
I didn't see it myself, so I didn't try anything to make it work (and it may 
well work if you're very very lucky).


=head1 CHANGELOG
* Version 0.122 - converts internal links (Buggy)
* Version 0.102 - Some bugfix
* Version 0.90 -- Renames Home page to Accueil (Main page)
                  Not sure it's useful for all wikis or just the one i'm working
                  on.
* Version 0.89 -- Handles a new way to have *bold* text, handles weird tables
(with |> and/or |>).
* Version 0.62 -- Now handles category (badly)
* Version 0.52 -- Verbatim ok too (piew)
* Version 0.51 -- Tables ok now (this one was tricky)
* Version 0.48 -- Less bugs (OMG there are bugs everywhere !)
* Version 0.37 -- Can insert (propretly) in a mysql db using dbi.
* Version 0.32 -- url : ok
                  verbatim : almost ok (in progress)
                  links : ok
                  tables : broken (regression since perl migration)
* Version 0.27 -- Handles verbatim pretty fine (exept when there are a and and
a start tag on the same line).
* Version 0.13 -- From now on, everything will be done in a single script perl
This will make things easier, i've lost too much time until now.
* Version 0.12 -- Fixes bugs, bugs and other bugs
* Version 0.3  -- 1st working version

=cut

use strict;
use Archive::Extract;
use File::Path;
use File::Copy;
use File::Find;
use DBI;


############
## Some vars
############
my $tmp_dir='/tmp';
my $snapshot='/PhpWikiLatestSnapshot';
my ($snap_from)=<$tmp_dir$snapshot*>;
my $working_dir=$tmp_dir.'/php2media_wiki';
my $origin=$working_dir.'/origin';
my $encoded=$working_dir.'/encoded';
my $translated=$working_dir.'/transcoded';
my $dbh = DBI->connect("DBI:mysql:database=mediawiki;host=server.example.com",
		"mediawiki", "migration",
		{'RaiseError' => 1});


# Pages linked to phpwiki. We don't need it for mediawiki.
my @toremove=qw/2004-04-07 2005-10-24 2005-10-25 2005-10-26 2005-10-27 2005-10-28 2005-10-29 2008-01-05
	Accueil AddingPages AdministrationDePhpWiki AjouterDesPages AliasAccueil AllPages AllUsers AnciennesRèglesDeFormatage ARCHIVE
	BacÀSable BackLinks
	CalendarPlugin CarteInterWiki CategorieCategorie CatégorieCatégorie CatégorieGroupes CategorieLogicielLibre CatégoriePagesAccueil CategoriePersonne CategoryCategory CategoryHomePages ChangementsLiés ChercherUnePage Chown ClassezLa CommentairesRécents CommentUtiliserUnWiki CreatePage CreateToc CategorieUpgradeOpenKnowledge
	DebugInfo DéfinirAcl DéposerUnFichier DernièresModifs DernièresModifsComplètes DerniersVisiteurs DétailsTechniques Deux DocumentationDePhpWiki Droits
	ÉditerLeContenu EditerLesMetaDonnées ÉditionsRécentes EditText ExternalSearchPlugin
	FindPage FullRecentChanges FullTextSearch FuzzyPages
	GestionDesPlugins
	Historique HistoriqueDeLaPage
	IcônesDeLien Info InfosAuthentification InfosDeDéboguage InfosSurLaPage InterWiki InterWikiMap
	LesPlusVisitées LienGoogle LikePages LinkIcons ListeDePages
	MagicPhpWikiURLs ModifsRécentesPhpWiki MoreAboutMechanics MostPopular
	NewMarkupTestPage NotesDeVersion
	OldMarkupTestPage OldStyleTablePlugin OldTextFormattingRules OrphanedPages
	PageAccueil PageAléatoire PageDeTests PageHistory PageInfo PagesFloues PagesOrphelines Pages Recherchées PagesSemblables PageTestAnciennesMarques PageTestNouvellesMarques PgsrcTranslation PhpHighlightPlugin PhpWeatherPlugin PhpWiki PhpWikiAdministration PhpWikiDocumentation PierrickMeignen PluginAlbumPhotos PluginBeauTableau PluginBonjourLeMonde PluginCalendrier PluginColorationPhp PluginCommenter PluginCréerUnePage PluginCréerUneTdm PluginÉditerMetaData PluginHistoriqueAuteur PluginHtmlPur PluginInclureUnCadre PluginInclureUnePage PluginInfosSystème PluginInsérer PluginListeDesSousPages PluginListeDuCalendrier PluginMétéoPhp PluginRechercheExterne PluginRedirection PluginRessourcesRss PluginTableauAncienStyle PluginTeX2png PluginWiki PluginWikiBlog PréférencesUtilisateurs
	Remplacer Renommer RandomPage RecentChanges RecentEdits RechercheEnTexteIntégral RechercheInterWiki RechercheParTitre RécupérationDeLaPage RèglesDeFormatageDesTextes ReleaseNotes RétroLiens REVOIR RolandTrique
	QuiEstEnLigne
	SandBox SommaireDuProjet SondagePhpWiki SteveWainstead StyleCorrect Supprimer
	TestDeCache TestGroupeDePages TextFormattingRules TitleSearch TousLesUtilisateurs ToutesLesPages TraduireUnTexte TranscludePlugin Trois
	Un URLMagiquesPhpWiki UserPreferences
	VersionsRécentes VisiteursRécents VU
	WabiSabi WikiPlugin WikiWikiWeb/;
my $table;


sub insert
{
  $_=shift;
  open FILE, $_;
  my $file=join '', <FILE>;
  $file =~ s/'/\\'/gs;
  close FILE;
  s!^.*/!!;
  s!%2F!/!g;
  ucfirst;

  $dbh->do("INSERT INTO page
		(page_id, page_namespace, page_title, page_counter, page_restrictions, page_is_redirect, page_is_new, page_random, page_touched, page_latest, page_len)
	VALUES	(NULL, 0, '$_', 0, '', 0, 1, RAND(), NOW()+0, 0, LENGTH('$file'))");

  $dbh->do("INSERT INTO text
		(old_id, old_text, old_flags)
	VALUES  (NULL, '$file', 'utf-8')");

  $dbh->do("INSERT INTO revision
	(rev_id, rev_page, rev_text_id, rev_comment, rev_minor_edit, rev_user, rev_user_text, rev_timestamp)
	SELECT NULL, page_id, LAST_INSERT_ID(),'PhpWikiMigration', 0, 1 ,'Admin', NOW()+0 FROM page WHERE page_title = '$_'");

  $dbh->do("UPDATE page,revision
	SET page.page_latest = LAST_INSERT_ID()
	WHERE page.page_id = revision.rev_page && revision.rev_id = LAST_INSERT_ID()");
    sleep 2;
}


sub replace
{
  $_=shift;
                                                            # Bullets and Numbers
  s/^\s*\*/'*' x length $&/e;
  s/^\*{5,}/****/;
  s/^\s*#/'#' x length $&/e;
  s/^#{5,}/####/;
  s/^\s*//;
  s!^:! !g;                                                          # Code - OK

                                                                # typeset markup
  s!%%%!<br/>!g;                                                # New lines - OK
  s/(\s)_+([^_]*)_+(\s)/$1''$2''$3/g;                             # italic -- OK
  s/''([^'])''/'''$1'''/g;                                          # bold -- OK
  s/([^*])\*([^*]+)\*/$1'''$2'''/g;                                 # bold -- OK

                                                           # header markup -- OK
  s/^(\**)!!!(.*)$/$1=$2=/g;
  s/^(\**)!!(.*)$/$1==$2==/g;
  s/^(\**)!(.*)$/$1===$2===/g;

                                                             # link markup -- OK
  s/\[\s*(https?:\/\/[^]|]*?)\s*\]/$1/g;
  s/\[\s*([^|]*?)\s*\|\s*([^]]*?)\s*\]/[[$2|$1]]/g;
  s/\[(https?:\/\/[^|]*)\|([^]]*)\]/$1 $2/g;
  s/\[([^| ]*)\]/[[$1]]/g;
                                                                # Implicit links
  $_=join ' ', map {(m/^(([A-Z]\w+){2,})$/ && -f "$encoded/$_")?"[[$_]]":$_} split / /, $_;
  s/\[{2,}([^]]+)\]{2,}/[[$1]]/;

                                                               # redirects -- OK
  s/<\?plugin RedirectTo page=(.*)\?>/#REDIRECT [[$1]]/;

                                                            # table markup -- OK
  $table=1 if (s/<\?plugin OldStyleTable.*/{|/);
  if ($table)
    {
      s/^\|[><]?/|-\n|/;
      s/\|[><]?([^{|\n][^\n|]*)/|$1\n/g;
      $table=0 if (s/\?>/|}/);
    }


  return "$_\n";
}


sub do_replace
{
  my $return='';
  $return.=replace $_ foreach (split /\n/, shift);
  return $return;
}

sub convert
{
  foreach my $file (<$encoded/*>)
    {
      my $return='';
      open FILE, $file;
                                   # Content only begins after the 1st blank line
      while (<FILE>) {last if (/^\s*$/);}
      my $data=join '', <FILE>;
      close FILE;
      while ($data =~ m!\A(.*)<verbatim>(.*)</verbatim>(.*)\Z!s)
        {
          $return=$2.do_replace ($3).$return;
          $data=$1;
        }
      $return=do_replace ($data).$return;
      $return =~ s/\nCategorie(\w+)\s*\Z/\n[[Catégorie:$1]]/;
      $file =~ s!^.*/!!;
      open NEWFILE, '>', "$translated/$file";
      print NEWFILE $return;
      close NEWFILE;
    }
}


sub prepare
{
  mkdir $working_dir;
  mkdir $encoded;
  mkdir $origin;
  mkdir $translated;
                     # unzip the archive, removes it, clean up and start working
  my $ae = Archive::Extract->new( archive => $snap_from );
  $ae->extract ( to => $origin ) or die $ae->error;
    # I don't know what those url* are. A bug in the phpwiki snapshot function ?
  rmtree <$origin/url*>, {verbose => 1};
  rmtree <$origin/PluginWikiBlog*>; #You probably should use this.
  finddepth {wanted => sub {-f && move $_, $origin;-d && rmdir}, no_chdir=>1}, <$origin*/>;
            # Moves every files in the root directory and delete subdirectories.
            # removes spaces in filenames. Otherwise, script will fail on those.
  find {wanted => sub {my $i=$_;s/\s/_/g && move $i, $_;}, no_chdir=>1}, $origin;
}

sub encode
{
  chdir $origin; 
  foreach (<$origin/*>)
    {
      s/.*\///;
      s/\|/\\|/g;
      `dos2unix -af $origin/$_`;                                # Dos format ftl
      `iconv  -f CP1252 -t UTF-8 $origin/$_ > $encoded/$_`;
      `convmv -f CP1252 -t UTF-8 --notest     $encoded/$_`;
    }
  unlink "$encoded/$_" foreach (@toremove);
  move "$encoded/HomePage", "$encoded/Accueil";
}


sub congrats
{
  print <<'EOF'
                                 _         _       _   _                   _ 
  ___ ___  _ __   __ _ _ __ __ _| |_ _   _| | __ _| |_(_) ___  _ __  ___  | |
 / __/ _ \| '_ \ / _` | '__/ _` | __| | | | |/ _` | __| |/ _ \| '_ \/ __| | |
| (_| (_) | | | | (_| | | | (_| | |_| |_| | | (_| | |_| | (_) | | | \__ \ |_|
 \___\___/|_| |_|\__, |_|  \__,_|\__|\__,_|_|\__,_|\__|_|\___/|_| |_|___/ (_)
                 |___/                                                       
EOF
}


#prepare;                                                         # Preprocessing
#encode;                                                               # Encoding
convert;                                                           # Transaltion
insert $_ foreach (<$translated/*>);                      # Insertion in a mysql DB
congrats;
#$dbh->disconnect();

