  1. Hi All, Sorry about the delay. Was busy with some more urgent stuff and my home server crashed (two HDDs dead at the same time - blessed is the RAID 6) which means I am waiting for two replacment hard drives to have a DB with suitable performance. They should arrive tommorow so I'll be able to finish this weekend for sure :-) Have a nice day you all, Murad
  2. Hi All here is a small update : - I still have some trouble matching the archive posts (that don't have any id's) to their respective threads because of special characters and various encodings however I am confident that I'll finish soon so hold tight :-)
  3. Hi All, Here is a small update : - I've decided to unplug and take a few days off and go on holiday so I am gonna have to postpone the release until late next week. Sorry for any inconvenience, Murad
  4. A little update: - The work is progressing well; I think that I'll have a definitive version late next week. Murad
  5. Hail All, Due to some real interest I have decided to proceed with the tuning. I'll keep you posted how it goes. Murad
  6. All right, after careful evaluation I may have some time to spare over this summer to work on the data but I ask the following from the community. Everybody who would like to see the data in their full available state please post a small message on this thread and based on the number of responses I will make the final decision whether letting this be at the current state or taking time to improve it. The partial data have been available for nearly 24 hours and not many people have acessed it so far. It would be suboptimal to spend my time over something nobody will use. Thanks, Murad
  7. Hail, First of all I would like to salute the person who wrote this script in such a short time that is able to extract so much of the data in precise manner. I've checked the script and here are my preliminary findings. There is definetly room for improvment. It would seem (take this affirmation with precaution) that the script didn't support all the skin html patterns in the raw html and sometimes haven't parsed them correctly, even might have parsed only one some of the available page patterns. One example might be this thread from the old KH forum that I've found in the original raw d
  8. right, I am checking the script now.... Murad
  9. Would it be possible for me to grab that script? I'd like to have a look at what he has done and at the parsed data.
  10. Although on the other hand, when I try to search for some posts that I remember I find them all that is encouraging. Below is the screen of 2005 Sosarian Morning Poste.
  11. Hmm it doesn't add up... if we assume that there were about 10 posts per page then the data would cover about 210k posts, however we must assume there was at least 1 page per thread, therefore for it to add up only 5k threads must have had more than one page. Also the table 'user' which contains member names that I parsed from the posts contains 3749 lines.... hmmm I guess we'll just have to wait and see.
  12. Hmm that sounds logical. Does anyone know how much data (pages, threads, posts) were on the old f4g forums?
  13. I hope he means that he has still 28k pages to do. If he means that only 28k pages are parsable then the data is in much worse shape than I hoped.
  14. I started to scrap the google cache about 36 hours after the f4g forums were declared as lost.... At that time the f4g forums weren't refreshed in google cache since the site was flagged as 'down'. By the time I've finished scraping the data, only first posts were starting to appear on f4g and being pulled into the google cache. The duplicate content should be minimal (less than 0.2 %) Murad
  15. Sounds more than promising
