A file is a collection of bytes stored on a disk or similar device, but we need an understanding of the devices that contain files and their advantages and limitations. This information will begin to explain the traditional mechanisms that have evolved for using files from programming languages generally and Python in particular.
The file as a data structure was devised for storing information on tapes and disks. Together with some other devices that are used rarely (e.g., cram files), these are referred to as secondary storage, where primary storage is the computer’s memory. Memory was (and still is) too expensive to store everything that is needed on a computer, so secondary storage has the advantages of being cheaper than memory and can contain a much larger amount of data. Modern disks can contain terabytes of data, where one terabyte (Tb) is 1012 bytes. It has been estimated that a human being’s functional memory is about 1.25 Tb. A terabyte is a lot of storage.
Most secondary storage devices store data magnetically. Since tapes are rarely seen anymore, the example presented here is that of a disk. A disk is a circular platter made of glass or ceramic material and coated with a thin layer of magnetic material, often a compound of iron. That’s why they look brown: iron oxide (or rust) is that color. The disk is mounted on a spindle that is connected to a motor, which spins it at a high rate of speed.
A device called a read/write head sits above the moving disk, but very near to it. This device is a small piece of magnetizable metal wrapped in a fine wire, not unlike the read/write heads in an old video tape recorder (VCR) or cassette machine. It is a property of magnets and coils that a moving magnet creates (induces) an electric current in a nearby coil, and a coil with a current flowing through it can create a magnetic field.
To write data to the moving disk, a current is sent to the read/write head, which creates a small magnetic mark on the disk below the head. Magnets have two orientations; they have a north pole and a south pole. Current flowing one way creates a magnet in the disk that has a north pole appearing before the south pole, or an N-S mark. Current flowing the other direction through the head creates a magnet on the disk that has the south pole appearing before the north pole, or an S-N mark. One orientation, say N-S, will represent a binary number 1, and the other (S-N) will represent a 0. In this way, binary numbers can be written to the surface of the moving disk.
Reading numbers involves the magnetic regions of the disk passing quickly past the read/write head and inducing small currents in the coil. These are amplified and classified by a simple electronic circuit that detects the current flow one way as N-S and another way as S-N, thus allowing binary numbers to be read from the disk.
There are some very complicated physics involved in a disk drive. The read/ write head must be very close to the surface of a rapidly rotating disk, as close as 3 nanometers. To accomplish this, the head is aerodynamically flying above the disk. If it ever touches the disk’s surface, the result is catastrophic. At the speeds involved, a large section of the magnetic material on the disk’s surface would be scraped away, and all data there would be lost. In addition, the read/write head would almost certainly be damaged. This event is called a head crash, and normally results in the entire disk drive being ruined. It’s one reason that frequent backup copies of all data should be made.
The picture that is developing is that of a device that returns data as a stream of bits. To make the best use of the area of the disk, the read/write head can move from the outer edge of the disk to nearly the center. Imagine a set of concentric circles on the disk’s surface: the moving read head can position itself over any of them and read the data that had been written there.
The disk is divided into a set of concentric circles called tracks, each of which corresponds to one position of the read/write head (Figure 5.2a). The head can move across the disk surface, but the positions are quantized: position 0-Ntracks can be reached through commands to a controller that change the head position. The outermost track is numbered 0, and the numbers increase as the head moves inward to the center. The disk is also divided into sectors, each of which is a wedge-shaped portion of the disk (Figure 5.2b). These are again numbered 0 to N „ , and create an address for a set of bits. Data can be read from sector 3 track 12 by positioning the read head over track 12 and waiting for sector 3 to rotate into position under the head. The data takes as long to read as the sector takes to pass under the read head.
This description answers two important questions. First, data can be accessed by using the <track, sector> address. The data in a single track and sector is a block, and all blocks are the same size in terms of bits for the sake of convenience, traditionally 512 bytes (4096 bytes for AF drives). Second, it explains why accessing data takes so long when reading from a disk. Disks rotate at 7200 RPM or 120 revolutions per second; this is one rotation every 8.3 milliseconds.
1. How Are Files Stored on a Disk?
A file can be thought of as a set of blocks. If blocks are 512 bytes in size and some data to be stored in a file consists of N bytes, then that file will need [M512] blocks, the next larger integer than N/512; it’s not possible to have two files share a single block.
It gets more complicated, though, because it will not always be possible to have all of the blocks that belong to a file lie next to each other. A file might consist of many blocks, all of which are some distance apart in terms of their track and sector. There is a need for a data structure to connect these blocks in the correct order to make a file. It’s not very hard to do but is another step. This data structure is written to the disk also. The result is that reading a file means finding the location of this data structure on the disk, getting the track and sector values, and then reading the data from those and copying it into memory. The data structure containing the sectors is usually found through a file name that the user has provided. There is a list of file names and the track/sector address of their index sectors in a special file someplace on the drive, or in many places. File systems tend to be organized hierarchically, so that one main name is accessed to find the files within that part of the disk (directory), and within that directory are names of more files and directories. It is a significant part of the function of an operating system like Linux or Windows to provide a convenient way to access files.
2. File Access is Slow
How long does it take to access a block of data on the disk? It depends on where the disk head is and where the disk rotation has placed the target block at the time the request is made. There will be only a statistical answer, but for a random block, it could take an average of 10 mS to move the head to the correct track (seek time), and will take half of a rotation (4.15 mS). Add to this the time needed to read the block, which is 8.3*1/ N „ mS, or about 0.008 mS for a disk with 1024 sectors. This can be ignored, and the time to access a random block can be estimated as 14.15 milliseconds.
As a comparison, fast computer memory can access data within 8 nanoseconds. If a person could write the word “Gigabyte” on a whiteboard in 8 nanoseconds, then what could they do in 14 milliseconds? They could copy the entire Bible onto the board over 16 times. Disks are vastly slower than memory, and to use the data, it must be copied into memory. This is a bottleneck in many computer systems.
Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.