Searching in Python

Searching is the act of determining whether some specific data item appears in a list and, if so, at which index. It seems like an odd thing to do; what can be done knowing this information? It is especially useful when multiple lists hold different data concerning the same items. An employee, as one example, might have their data saved as a name list, an employee ID list, phone number, office number, and home address. The same index gives information of the same individual for each list. Thus, search the employee ID list for 18762; if that index is 32, then the employee’s name can be found at name[32].

Of course, Python has built-in operations on a list that will do this:

if 18762 in employeeID: # Is this ID a member of

# the list?

k = employeeID.index(18762) # What is the index of 18762?

A reason to examine searching algorithms is that not all languages possess these specific features and not all programs are written in Python. Another is that someone had to implement the operations for the Python system itself, and they had to know how. Are the built-in operations as fast as ones that a programmer could code for themselves?

1. Timings

Any section of code in Python requires some amount of time to execute. The specific amount depends on many things: the computer being used, the Python compiler, the specific statements, the data, and random events, such as what other programs are executing on that computer at the same time. However, if it is important to know whether a section of code is faster than another, there are timing functions that can provide a pretty good idea. The time module includes a function named clock() that returns (on Windows) the elapsed time expressed in seconds elapsed since the first call to this function. On Linux it behaves differently, and time.time() may be a better choice.

Timing a section of code is done by calling time.clock() before and after the code executes and subtracting the two times. For example, timing a search of a list using the in operator could be done in the following way:

import time

list = [19872,87656,10982,18756,56344,29765,12856,12534, 88768,90012]

t0 = time.clock()

if 90012 in list:

found = True

t1 = time.clock()

print (“Time was “, t1-t0)

This prints the message:

Time was 2.062843880463903e-05

That’s a pretty small time, as is to be expected. When run again, the result was 3.07232e-06; running again gets 2.194514766e-06 and again 7.9002531e-06. These numbers are all small but very different. Since that is true, it is better to time many executions of the code and divide by the number of times it ran:

t0 = time.clock()

for i in range (0,10000):

if 90012 in list:

found = True

t1 = time.clock()

print (“Time was “,(t1-t0)/10000)

This yields more consistent results: 5.5284e-07, 5.5951e-07, and 5.415e-07 in three different trials. Averaging the result of multiple trials gives even better results, because spurious times on any one run will be averaged out.

2. Linear Search

Consider the list that was used in the timing example:

list = [19872,87656,10982,18756,56344,29765,12856,12534, 88768,90012]

Finding whether the target number 90012 appears in this list is a matter of looking at each element to see if it is equal to the target. If so, the answer is “yes,” and the index at which it was found is also known. This can be done using a basic for statement:

index = -1

for i in range(0,len(list)):

if list[i] == target:

index = i

break

# If the value of index is >= 0 then it was found.

This algorithm looks at each element asking “Is this equal to the target?” When/if the target is located, the loop ends and the answer is known. If the target is not a member of the list, then the algorithm has to examine all members of the list to determine that fact. Thus, the worst case is when the element is not in the list, and it requires N comparisons to find that out. If the element is a part of the list then, on the average, it will require N/2 comparisons to find it. It could be the first element, or the last, or any of the others, which averages out to N/2.

If the list is in sorted order, then the loop can be exited as soon as it is known whether the element is in the list. That is, as soon as the target is smaller than the element it is being compared against in the list, it is clear that it can’t be a member of the list, and the loop can be exited. This normally speeds up the execution, but the penalty is that the list has to be sorted, and the time needed to do this (only once, of course) has to be taken into account.

3. Binary Search

If the list has been sorted then there is a faster way to search for an element. The list can be divided into two parts by looking at the value in the middle of the list and comparing it to the target. If the target is smaller than the middle element, then it would have to be in the lower indices (left), otherwise it would have to exist in a higher valued index (to the right). What this means for performance is that the search area is cut in half each time a comparison is done.

This idea seems simple, but is actually difficult to get right in an implementation. At conferences where many PhDs in computer science are presenting papers, it has been found that fewer than 10% of the participants can code a binary search that works the first time. The terminal conditions are tricky: in particular, how can it be determined that the target is not in the list? The details are crucial. At the beginning there is a list, and its length is known. The index of the middle element is known too, and the list is sorted. Find the index of the middle element:

istart = 0

iend = len(list)

m = (iend+istart)//2

If the target is in the list, is it at a smaller index than m (i.e., is list[m]>target):

if list[m]>target:

If so, don’t bother looking at any index bigger than m. In other words, the largest index to look at would be m-1:

iend = m-1

If the target is in the list, is it at a larger index (i.e., is list[m]<target)? If so, don’t look at any locations with an index less than m; in other words,

elif list[m]<target:

istart = m+1

If target = list[m], then it has been found and the algorithm terminates.

else:

return m

This code has to be repeated until the target has been found, or it has been determined that it is not in the list. The loop condition is critical. The loop continues so long as istart <= iend, so that if the final step finds the target in the list, then it will return the index. If the loop exits without finding the element, then the index value is -1. The final code, as a function, is

def search (list, target):

istart = 0

iend = len(list)

while istart<=iend:

m = (iend+istart)//2

if list[m]>target:

iend = m-1

elif list[m]<target:

istart = m+1

else:

return m

return None

The speed of the binary search depends on the fact that it is searching a randomly accessible data set like a Python list or a Java array, and not a file. It takes on the order of log(n) probes into the list to find what it is looking for or to determine that it is not there.

Timing the binary search gave an execution time of 3.305e-06 seconds, still slower than the built-in operation.

Source: Parker James R. (2021), Python: An Introduction to Programming, Mercury Learning and Information; Second edition.

1. Timings

2. Linear Search

3. Binary Search

Leave a Reply Cancel reply

Login