7.6 Exercises

7.6  Exercises

1.  (Easy to Medium: System Commands)

Use the system command or the backquote method to compress a file. Use a command such as the ones commonly available on a Unix platform: compress, zip, or gzip, etc. Extend the program to compress every file in a certain directory. Leave sub-directories as they are.

Finally, extend the program further to compress sub-directories recursively. Use the system command cd to go into a directory. Use system commands wherever you can. Avoid using Perl functions such as chdir.

2.  (Easy: Pipe)

Study documentation on the Unix find command. The find command generates the names of files recursively below a certain directory. Now, use this command in a pipe to search every file in a directory recursively. The program’s first command-line argument is the directory to be searched recursively. The arguments that follow are one or more words to be searched in the files. Report full paths for every file where all the words occur.

3.  (Easy: E-mail, Pipe)

You are given one or more files each of which contains a number of valid e-mail addresses, one per line. Use a pipe to create a simple mailing list of your own. Your program takes the names of a message file, and one or more mailing list files as command-line argument. It sends a mail message to every individual in the mailing list files, one by one. The message file contains a well-demarcated subject on top, say indicated by the keyword Subject:. This keyword is followed by the subject of the mail message. The keyword and the text of the subject are in one line. There are one or more blank lines following the subject line, and then the
actual text of the mail message.

4.  (Medium: E-mail, Pipe, Fork)

Write a program that modifies the mailing list program in the previous problem. Its purpose is exactly the same as the program in the previous problem. However, it forks a new process for each mailing list file. In other words, if the program is given several mailing list files containing e-mail addresses, it forks a separate process to send the message to the e-mail addresses in one file. The main process does not send any e-mail, only the child processes do. As the program’s processes send e-mail, each process logs who the e-mail has been sent to in a log file. Each process logs to
the same log file, not to separate files. The log file’s name is given as a command-line argument as well.

5.  (Medium to Hard: Web Site Mirroring)

Write a program that mirrors a Web site. It takes a URL as a command-line argument. It also takes a destination mirror directory name on the machine where it is run. It fetches every file under the top-level URL and creates appropriate sub-directories. Every file is copied to the right location in the mirror. Use sockets and do not use any other more powerful packages that you may be able to find. There may be a large number of problems you can face in writing such a program. Document your problems, discuss and implement the solutions that you can find.

6.  (Medium: Web Site Traversal, Stale Links)

As a Web site usually gets older, it starts to contain dead or stale links, if not kept up-to-date. Write a program that takes a Web site’s URL, recursively looks at the Web pages at the Web site, and reports any dead or stale links along with the URLs of the file where it occurs. Define a stale link as one that cannot be fetched within a certain timeout period. Write the URLs of all stale links to a file whose name is also given as a command-line argument. Discuss any problems you face and how you solve them. Use TCP-based sockets.

7.  (Medium to Hard: Traversing Web Site, System Commands)

Given a URL as argument, recursively obtain the files in the Web site indicated by the URL. The goal here is to find out how many graphic files with .jpg, .jpeg, or .gif extensions are there in the site, and the individual and cumulative sizes of these files. Report these data by printing a table to a file whose name is given as command-line argument.

Now, extend the program in the following way. Give the program an additional command-line argument. It is a number, say 100,000. As the user if he or she wants to view the “large” graphic files once all the graphic URLs have been fetched, and written to a file. If the user says so, let the program use a system command to display the files bigger than the given size, one by one in graphic viewer, say, Electric Eyes (ee) on a Unix machine, or a Web browser such as Netscape. Show the files one by one, in decreasing order of size. Discuss any problems you face, and how you solve them. Use TCP-based socket connections for communication.

8.  (Medium to Hard: Text Processing, Traversing Web Site, Research)

Write a search engine for a “small” Web site. The program takes a URL as a command-line argument. It fetches every text file in the site. For each file fetched, it removes all HTML tags. It also removes commonly occurring words such as “is”, “the”, “it”, etc. Make a good list of such words. Such words are called stop words. We will take a simplistic approach to creating our search engine. For each file, compute the frequencies of the words that occur in it. Consider the words with the top five frequency counts as keywords. For each keyword, store in a file (one would normally use a database, but we have not learned how to use one) where it occurs. The organization of this file should
be such that it is easy to search.

The search program is a separate program. Normally, it is a CGI program that is invoked by an HTML form, but CGI is not discussed till later in this book. The search program, for now, is invoked on the command-line. It takes one or more keywords, and returns a list containing the URLs where all the keywords occur.

This is a very simple start to writing a search engine. Discuss any problems you face and how you solve them. Discuss what bigger problems arise that you cannot solve, or solve well. Research into the problems, possible solutions, and write a research report.

9.  (Medium to Hard: Text Processing, Traversing Web Site, Research, Can be a long-term project)

Write a program that takes a URL as a command-line argument. Use TCP-based sockets for communication. It finds the most commonly used words, bigrams and trigrams in each file of the site. A bigram is a a pair of consecutive words. A trigram is a sequence of three words. The program removes all HTML tags found in the Web pages. The program has a stop list of words. The program removes all such words from consideration. The program also removes bigrams and trigrams that begin or end with such words.

Extend the search program in the previous problem so that it can search for bigrams and trigrams as well.

Examine the bigrams and trigrams you collect. There are a huge number of them. Many of these are sequences for which no one would ever search. How can we reduce the number of bigrams and trigrams to those for which a Web server may conceiveably search. Write your thoughts out in a report.

10.  (Medium to Hard: Text Processing, Traversing a Web Site, Research, Can be a long-term project)

Write a program that takes a URL as command-line argument. It removes all HTML tags from the Web pages that it recursively fetches. The goal of this program is to parse the pages and try our best to obtain all proper names used in these pages. Use heuristics such as that all proper names in English start with an uppercase letter. Use any other heuristics you can come up with. Use bigrams and trigrams as well. List all the “proper names” you gather. Examine them to see what percentage of the collected words and phrases are actually proper names. Discuss how this percentage can be improved.

11.  (Medium: FTP, File Transfer, Client-Server)

Write a program which is similar to the interactive client-server programs discussed in this Chapter. Here, we assume you preferably have accounts on two machines, although one also will work. The program is like a simple FTP (File Transfer Protocol) client-server pair.

A client can request the server with a file name to send. When the client gets the file, it stores the file on the client’s machine with the same file name.

A client can also tell the server that it is sending the server a file with a specific name. If the server agrees, the client sends the file. Once the server gets the whole file, it stores it on the server with the same file name.

First, think of the details of how the communication should proceed between the client and the server. Once you have it clear in your mind, write it down, and then write the code. Discuss any problems you face, and how you solve them. Use TCP-based sockets for communication.

Note that in a real FTP program, a remote user has to log in to the server’s computer. We do not bother about this issue here. Do not fork your client or server.

12.  (Medium: FTP, Client-Server, Forking)

Modify the program you write for the previous program so that the client and server are both forked.

13.  (Hard: FTP, File Transfer, Client-Server, Forking)

In a real FTP program, the client and the server actually use two ports for communication. One pair of ports is used for the communicating the commands for sending a file, receiving a file, and for reporting status such as a command has been carried out successfully. Another pair of ports is used to actually transfer the contents of the files or the data between the client and the server. Extend your program to follow this model of communication. Continue using TCP-based sockets for communication.

Draw a nice diagram showing the communication steps first. Write out the algorithms on paper before you start implementing them. Once again, we are not worried about the user logging in and being authenticated.

14.  (Hard: FTP, Recursive File Transfer, Client-Server, Forking)

Extend the program in the previous problem by adding two more commands.

•    A command that a client can use to send all files in a directory, recursively to the server.

•    A command that a client can use to request all files in a directory recursively from a server.

15.  (Hard: FTP, File Transfer, Client-Server, Can be a long-term project)

The File Transfer Protocol (FTP) is a commonly used protocol on the Internet, used for transferring files between two machines. It works using the client-server model of computing, above TCP/IP.

Familiarize yourself with the FTP client. Learn the commands. FTP works across different hosts and file structures. However, let us focus on how it works on the Linux machines in our labs.

Summarize the FTP client commands on a single page.

Study the RFC from 1985 that describes the initial FTP protocol. Summarize the RFC on a page or two.

Write code for an FTP client and an FTP server. Implement the following FTP client commands: open, user login and password, cd, lcd, ls, mkdir, get, put and close. You do not have to implement it faithfully following the RFC, but make the program functional.

Use TCP-based sockets.

16.  (Medium to Hard: E-Commerce, SET, Research, Can be a long-term project)

The Secure Electronic Transaction (SET) is an encryption and security specification designed to protect credit card transaction on the Internet.

SET is quite complicated. It provides for secure secure communication channels among all parties involved in a transaction. The specification of SET came out in 1997 and is 971 pages long.

The participants in SET are:

(a) the card holder,

(b) the merchant,

(c) the card issuer,

(d) the acquirer,

(e) the payment gateway, and

(f)  the certification authority.

SET is discussed in somewhat detail in chapter 14 of a text such as Stalling [Sta99]. Study the description of SET in this text, or any other book, or on the Web.

Write code that allows you to set up the entities involved in a transaction and allow communication among them, two at a time. Note that not all parties have to communicate with every other party.

Describe the code you write in one or two typed pages.

17.  (Hard: Instant Messaging, Research, Can be a long-term project)

Write a program that is an instant messenger (IM). Instant messenger programs are not new. A program called talk or variants of it have been available on Unix machines for decades. These days, instant messengers have graphical user interface (GUI), and can handle graphic images, audio, and video files. However, to start, let us focus on text exchange only. Write a program that allows two individuals to exchange text messages like one of the currently popular instant messenger programs. Do not worry about the beauty of the display to begin with. Just make it functional.

Once you have a working IM program, now write down a list of all the ways you can improve it. Also, write how you can implement the improvements. Of course, one way for improvement is to use a graphical user interface. It is not difficult to build graphical user interfaces in Perl although we have not discussed how to do so in this book.

Extend the program so that you can exchange images and audio files. Invoke a display program for invoked images. Similarly, invoke an audio program for fetched audio files.

18.  (Hard: Peer-to-peer File Exchange, Research, Can be a long-term project)

Write a program that allows peer-to-peer exchange of files between two individual computers over the Internet. This program can be an extension of the file transfer programs that some of the earlier problems have asked you to write. List the problems you need to address in writing such a program. Peer-to-peer file exchange programs can be used to transfer text, music or video files among users, dispersed around the world.

Extend the program to invoke programs on your computer to display or play the exchanged files.

19.  (Medium to Hard: Network Services, Research)

Perl has a large number of modules that deal with network services such as TELNET, FTP, Netnews, Mail, etc. Find out what these modules are. Learn how to telnet to a remote machine and run commands on it from your Perl program. Learn how to FTP files and directories automatically from a Perl program. Such a program can be used to back up your files to another machine from time to time. Learn how to read and post messages to to Internet newsgroups from a Perl program. Finally, learn how to read POP or IMAP mail from your Perl program.

20.  (Medium to Hard: Remote Procedure Calls, Research)

Find how remote procedure calls (RPC) can be made from Perl. Where can you use RPCs?

21.  (Medium to Hard: Domain Name Server, Research)

Perl has modules to interface with a DNS (Domain Name Server). Find out what these modules are and learn how to use them. Use them to rewrite your network programming problems in Chapter 7.