RatSkep Archive for Safety

Anything that doesn't fit anywhere else.

Moderators: Blip, The_Metatron

Re: RatSkep Archive for Safety

#41  Postby NineBerry » Jun 25, 2020 10:15 am

I have thought about contacting archive.org and offering them the archive of rd.net to keep for posterity. It is a great example of what political discussions were going on in that era among the general population.
User avatar
NineBerry
RS Donator
THREAD STARTER
 
Posts: 6133
Age: 42
Male

Country: nSk
Print view this post

Ads by Google


Re: RatSkep Archive for Safety

#42  Postby Ironclad » Jun 25, 2020 6:01 pm

It sounds like a very good idea :)
For Van Youngman - see you amongst the stardust, old buddy

"If there was no such thing as science, you'd be right " - Sean Lock

"God ....an inventive destroyer" - Broks
User avatar
Ironclad
RS Donator
 
Name: Nudge-Nudge
Posts: 23878
Age: 51
Male

Country: Wink-Wink
Indonesia (id)
Print view this post

Re: RatSkep Archive for Safety

#43  Postby NineBerry » Jun 28, 2020 10:08 pm

NineBerry wrote:First round of downloading is over. But so is my holiday.

Next steps:

1. Verify download is complete. (ETA: 3 days)
2. Download user data and avatars. (ETA: 5 days)
3. Gather meta data from downloaded data. (ETA: 10 days)
4. Download images used in forum (Only images hosted on ratskep itself, no external images). (ETA: 14 days)
5. Create offline version of files that contain corrected cross links between pages, links to downloaded offline images and have reduced overhead. (ETA: 21 days)
6. Offer packaged archive for download (ETA: 23 days)


The current raw data (Only HTML) consists of 165,627 files with altogether 16.1 GB of data. I estimate the end result will have aproximately 6GB of data, maybe 1 GB in compressed form.



Progress report:

Steps 1 and 3 from the list above are done now.

We have 2,595,092 different posts by 4,542 different users in 49,877 different threads.

Next two steps will be 2 and 4.
User avatar
NineBerry
RS Donator
THREAD STARTER
 
Posts: 6133
Age: 42
Male

Country: nSk
Print view this post

Re: RatSkep Archive for Safety

#44  Postby kiore » Jun 29, 2020 2:47 am

Thankyou for this work and the progress report.
Folding@Home Team member.
Image
What does this stuff mean?
Read here:
general-science/folding-home-team-182116-t616.html
User avatar
kiore
Senior Moderator
 
Posts: 16361

Country: In transit.
Print view this post

Re: RatSkep Archive for Safety

#45  Postby viocjit » Jul 24, 2020 8:54 pm

NineBerry which program do you use to copy the content of RatSkep ?
User avatar
viocjit
 
Posts: 193
Male

Country: France
France (fr)
Print view this post

Re: RatSkep Archive for Safety

#46  Postby NineBerry » Jul 30, 2020 11:07 am

viocjit wrote:NineBerry which program do you use to copy the content of RatSkep ?


I write the software myself.

I have to confess I am a bit behind my own schedule. A lot to do for work and good weather that calls me to spend time outside after work.
User avatar
NineBerry
RS Donator
THREAD STARTER
 
Posts: 6133
Age: 42
Male

Country: nSk
Print view this post

Re: RatSkep Archive for Safety

#47  Postby viocjit » Aug 07, 2020 8:18 pm

NineBerry wrote:
viocjit wrote:NineBerry which program do you use to copy the content of RatSkep ?


I write the software myself.

I have to confess I am a bit behind my own schedule. A lot to do for work and good weather that calls me to spend time outside after work.


The software was made in which programming language ?
User avatar
viocjit
 
Posts: 193
Male

Country: France
France (fr)
Print view this post

Ads by Google


Re: RatSkep Archive for Safety

#48  Postby NineBerry » Aug 07, 2020 10:28 pm

I use C# this time around. The last one I created for archiving the Richard Dawkins Forum was written in Delphi. Worked just as good that time but I am more used to using C# now.

This is the whole code I wrote for downloading the forum and getting meta information from the downloaded pages. Note that this is not production level code. That's not how the code I write at work looks like. It's making shortcuts and doesn't have much safety built in, but that's okay, because the code has only one purpose and runs only on my computer.


Code: Select all
using System;
using System.Collections.Generic;
using System.IO;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Text.RegularExpressions;
using MSHTML;
using FirebirdSql.Data.FirebirdClient;

namespace RatSkep
{
    public partial class Form1 : Form
    {

        private Dictionary<string, int> pForums = CreateForums();

        private static Dictionary<string, int> CreateForums()
        {
            Dictionary<string, int> lList = new Dictionary<string, int>();
            lList.Add("Welcome to RationalSkepticism", 2);
            lList.Add("Announcements", 59);
            lList.Add("Updates", 51);
            lList.Add("Welcome New Members", 28);
            lList.Add("Science & The Humanities", 7);
            lList.Add("General Science & Technology", 15);
            lList.Add("Technical Design and Engineering.", 98);
            lList.Add("Physical Sciences & Mathematics", 61);
            lList.Add("Astronomy & Space Science", 8);
            lList.Add("Chemistry", 10);
            lList.Add("Earth Sciences", 11);
            lList.Add("Mathematics", 12);
            lList.Add("Physics", 14);
            lList.Add("Biological Sciences", 9);
            lList.Add("Evolution & Natural Selection", 62);
            lList.Add("Medicine", 63);
            lList.Add("Psychology & Neuroscience", 64);
            lList.Add("Social Sciences & Humanities", 43);
            lList.Add("Anthropology", 65);
            lList.Add("History", 66);
            lList.Add("Philosophy", 67);
            lList.Add("Sociology", 68);
            lList.Add("Linguistics", 69);
            lList.Add("Economics", 97);
            lList.Add("Belief & Nonbelief", 3);
            lList.Add("Nontheism", 6);
            lList.Add("Atheists in Foxholes", 57);
            lList.Add("Student Life", 70);
            lList.Add("Theism", 16);
            lList.Add("Christianity", 4);
            lList.Add("Islam", 5);
            lList.Add("Other Religions & Belief Systems", 49);
            lList.Add("Debunk It", 34);
            lList.Add("Creationism", 35);
            lList.Add("General Debunking", 39);
            lList.Add("Conspiracy Theories", 38);
            lList.Add("Paranormal & Supernatural", 37);
            lList.Add("Pseudoscience", 36);
            lList.Add("General Topics", 31);
            lList.Add("General Discussion", 33);
            lList.Add("Social & Fun", 74);
            lList.Add("Games", 89);
            lList.Add("Mafia", 90);
            lList.Add("Events", 60);
            lList.Add("Debates", 75);
            lList.Add("Other Languages", 82);
            lList.Add("News, Politics & Current Affairs", 30);
            lList.Add("The Arts & Entertainment", 71);
            lList.Add("Books", 93);
            lList.Add("Film & TV", 94);
            lList.Add("Music", 95);
            lList.Add("Video Games", 96);
            lList.Add("Parenting & Education", 42);
            lList.Add("rationalskepticism.org", 91);
            lList.Add("Feedback, Site Suggestions & Bug Reporting", 46);

            return lList;
        }

        public Form1()
        {
            InitializeComponent();
        }

        private string StringToSQLLiteral(string fString)
        {
            string lResult = fString.Replace("'", "''");
            return "'" + lResult + "'";
        }

        private async void buttonDownload_Click(object sender, EventArgs e)
        {
            Log("Start download");

            int lStart = 1;
            int lEnd = 29712;

            for (int i = lStart; i <= lEnd; i++)
            {
                Log("Processing " + i);

                await DownloadThread(i);
                await Task.Delay(10);
            }

        }

        private async Task DownloadThread(int fThread)
        {
            int lPage = 0;

            bool lFurtherPage = true;
            string lNextMarker = @"class=""right-box right"">Next</a>";

            while (lFurtherPage)
            {
                Log("Processing " + fThread + " - " + lPage);


                string lUrl = GetPageUrl(fThread, lPage);
                string lFileName = GetPageFile(fThread, lPage);

                await NavigateToUrl(lUrl);

                string lText = webBrowser1.DocumentText;
                File.WriteAllText(lFileName, lText);

                lPage++;

                lFurtherPage = lText.Contains(lNextMarker);
            }
        }

        private async Task NavigateToUrl(string lUrl)
        {
            webBrowser1.Stop();
            webBrowser1.Navigate(lUrl);
            while (webBrowser1.ReadyState != WebBrowserReadyState.Complete)
            {
                await Task.Delay(100);
                Application.DoEvents();
            }
        }

        private string GetPageFile(int fThread, int lPage)
        {
            string lPath = Path.GetDirectoryName(Application.ExecutablePath);
            lPath = Path.Combine(lPath, "..\\..\\..\\..\\..\\RawDownload\\");
            lPath += $"{fThread}-{lPage}.html";

            return lPath;
        }

        private string GetPageUrl(int fThread, int fPage)
        {
            string lResult = string.Format("http://www.rationalskepticism.org/viewtopic.php?t={0}&start={1}", fThread, fPage * 20);
            return lResult;
        }

        private string GetLogFile()
        {
            string lFileName = Path.ChangeExtension(Application.ExecutablePath, ".log");
            return lFileName;
        }

        private void Log(string v)
        {
            File.AppendAllText(GetLogFile(), v + Environment.NewLine);
            textBox1.AppendText(v + Environment.NewLine);

        }

        private void buttonGotoPage_Click(object sender, EventArgs e)
        {
            webBrowser1.Navigate("http://www.rationalskepticism.org/ucp.php?mode=login");
        }

        private void ExecuteSql(string sql)
        {
            using (FbCommand command = Connection.CreateCommand())
            {
                command.CommandText = sql;
                command.ExecuteNonQuery();
            }

        }

        private FbConnection pConnection = null;
        private FbConnection Connection
        {
            get
            {
                if (pConnection == null)
                {
                    pConnection = CreateConnection();
                    pConnection.Open();
                }

                return pConnection;
            }
        }

        private FbConnection CreateConnection()
        {
            FbConnectionStringBuilder connectionStringBuilder = new FbConnectionStringBuilder();
            connectionStringBuilder.UserID = "sysdba";
            connectionStringBuilder.Password = "masterkey";
            connectionStringBuilder.Database = @"J:\RatSkep\Database\RatSkep.FDB";
            connectionStringBuilder.ServerType = FbServerType.Default;
            connectionStringBuilder.DataSource = "localhost";


            FbConnection result = new FbConnection(connectionStringBuilder.ConnectionString);

            return result;
        }

        private async void buttonRetrieveMeta_Click(object sender, EventArgs e)
        {
            Log("Start Retrieve Meta");

            int lStart = 1;
            int lEnd = 56666;

            for (int i = lStart; i <= lEnd; i++)
            {
                Log("Processing " + i);

                await RetrieveMeta(i);
            }
        }

        private async Task RetrieveMeta(int threadId)
        {
            int lPageCount = await AnalyzeFirstPage(threadId);
            if (lPageCount > 0)
            {
                for(int i=0; i<lPageCount; i++)
                {
                    await AnalyzePage(threadId, i);
                }
            }
        }

        private async Task AnalyzePage(int threadId, int fPage)
        {
            Log("Retrieve Page Meta " + threadId + "-" + fPage);

            string lPath = GetPageFile(threadId, fPage);
            await NavigateToUrl(lPath);

            int lPageSeq = 1;

            IHTMLDocument7 lDoc = webBrowser1.Document.DomDocument as IHTMLDocument7;
            var lElements = lDoc.getElementsByClassName("post");
            foreach(dynamic lElement in lElements)
            {
                string lID = lElement.id.Substring(1);   

                int lPostId = Convert.ToInt32(lID);

                dynamic lInput = lElement.querySelector("input[name='comment_to_id']");
                int lUserId = Convert.ToInt32(lInput.Value);

                SetDatabasePost(lPostId, threadId, fPage, lPageSeq, lUserId);

                lPageSeq++;
            }


        }

        private void SetDatabasePost(int fPostId, int threadId, int fPage, int fPageSeq, int fUserId)
        {
                string lSql = string.Format(@"UPDATE OR INSERT INTO ""Posts"" (ID , ""ThreadID"", ""ThreadPage"", ""PageSeq"", ""UserID"") VALUES ({0}, {1}, {2}, {3}, {4})",
                   fPostId, threadId, fPage, fPageSeq, fUserId);
                ExecuteSql(lSql);
        }

        private async Task<int> AnalyzeFirstPage(int threadId)
        {
            Log("Retrieve First Page Meta " + threadId);

            string lPath = GetPageFile(threadId, 0);
            await NavigateToUrl(lPath);

            var lStatus = CheckFirstPageIsForum(threadId);
            if (lStatus == ThreadStatus.ThreadStatusMeta)
            {
                // Get Num Pages
                int lNumPages = GetNumPages();

                // Check pages available
                string lLastPage = GetPageFile(threadId, lNumPages - 1);
                if(!File.Exists(lLastPage))
                {
                    Log("Page missing " + lLastPage);
                    throw new Exception("Page missing " + lLastPage);
                }

                // Get Title
                string lTitle = GetTitle();

                // Get Forum
                int lForum = GetForum();

                // Set Database entry
                SetDatabaseThread(threadId, lTitle, lStatus, lNumPages, lForum);

                return lNumPages;
            }
            else
            {
                // Set Database Entry
                SetDatabaseThread(threadId, "", lStatus, 0, -1);

                return 0;
            }
        }

        private int GetForum()
        {
            var lLinks = webBrowser1.Document.Links;
            HtmlElement lBackLink = null;
            foreach(HtmlElement lLink in lLinks)
            {
                if(lLink.GetAttribute("accesskey") == "r")
                {
                    lBackLink = lLink;
                    break;
                }
            }

            string lText = lBackLink.InnerText;
            lText = lText.Substring(10);

            return pForums[lText];
        }

        private string GetTitle()
        {
            var lHeaders = webBrowser1.Document.GetElementsByTagName("h2");
            var lHeader = lHeaders[0];
            return lHeader.InnerText;
        }

        private int GetNumPages()
        {
            IHTMLDocument7 lDoc = webBrowser1.Document.DomDocument as IHTMLDocument7;
            var lElements = lDoc. getElementsByClassName("pagination");
            var lEnumerator = lElements.GetEnumerator();
            lEnumerator.MoveNext();
            IHTMLElement lFirst  = lEnumerator.Current as IHTMLElement;

            string lText = lFirst.innerText;

            string lMatch = Regex.Match(lText, "Page \\d* of (\\d*)").Groups[1].Value;

            return Convert.ToInt32(lMatch);
        }

        private void SetDatabaseThread(int threadId, string fTitle, ThreadStatus fStatus, int fPages, int fForum)
        {


            string lSql = string.Format(@"UPDATE OR INSERT INTO ""Threads"" (ID , ""Title"", ""Status"", ""Pages"", ""Forum"" ) VALUES ({0}, {1}, {2}, {3}, {4})",
                threadId, StringToSQLLiteral(fTitle), (int)fStatus, fPages, fForum
                );
            ExecuteSql(lSql);
        }

        private enum ThreadStatus
        {
            ThreadStatusGone = 1,
            ThreadStatusNoAccess = 2,
            ThreadStatusMeta = 3,
        }


        private ThreadStatus CheckFirstPageIsForum(int threadId)
        {
            var lPhpBB = webBrowser1.Document.GetElementById("phpbb");
            var lMessage = webBrowser1.Document.GetElementById("message");
            var lForumHeader = webBrowser1.Document.GetElementById("topic-search");
            var lError = webBrowser1.Document.GetElementById("http500");

            if (lForumHeader != null && lPhpBB != null)
            {
                // Ok

                var lAction = lForumHeader.GetAttribute("action");
                if(!lAction.EndsWith("?t=" + threadId))
                {
                    Log("Error: Wrong content Thread " + threadId);
                    throw new Exception("Error: Wrong content Thread " + threadId);
                }
               

                return ThreadStatus.ThreadStatusMeta;
            }

            if (lMessage != null && lPhpBB != null && lForumHeader == null)
            {
                // Is Not Accessible
                return ThreadStatus.ThreadStatusNoAccess;
            }

            if (lError != null && lPhpBB == null)
            {
                return ThreadStatus.ThreadStatusGone;
            }

            throw new Exception("Unknown page Content");
        }
    }
}

User avatar
NineBerry
RS Donator
THREAD STARTER
 
Posts: 6133
Age: 42
Male

Country: nSk
Print view this post

Re: RatSkep Archive for Safety

#49  Postby viocjit » Aug 14, 2020 10:52 am

NineBerry , thanks for sharing the source code of your program with us.

What is the full list of programming languages do you use or used ?

I'm not a programmer and I have a very bad level in programming.
I tried a dozen language but don't know with certainty the full list.
User avatar
viocjit
 
Posts: 193
Male

Country: France
France (fr)
Print view this post

Previous

Return to General Discussion

Who is online

Users viewing this topic: No registered users and 1 guest