One of my assignments in my CS 240 course was to design and implement a "Web Cloner"--similar to the unix utility "wget" and the Internet Explorer "Archive" feature.
This project required more planning than any previous project I had worked on, and would take 1500-2000 lines of code to complete. For this reason, we were required to create a design document to...document our design before beginning coding. I found this design document very helpful in the implementation of the web cloner, since much of the big picture had been taken care of beforehand.
The final result was a command-line utility that is given a web address and a local directory to save the web page to. The web page at that address was then copied to the disk, along with any pages linked from the first that were located on the same domain.
The web page found at the specified location is copied to disk and all of the links in the page were put in a list to be processed. Each link was then dealt with as follows:
- Links to files in the same domain are archived as well (and html pages are subsequently parsed for their links)
- External links (to files in other domains) are replaced with links to a stub html file showing the external link
- References to selected internal image file types (JPG, GIF, PNG) are downloaded and references are replaced with references to the local files
This project was fun because the result was something easy to appreciate and evaluate. It was interesting to see how the more "professional" tools might handle this task and how much they have to deal with. It made me appreciate those tools more as well as make me want to be a part of some larger project like that.