Developing a New Search for Randomly Accessible Data

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them.

Contents

Summary
Introduction
Materials and Methods
Discussion
Conclusion
References
Acknowledgements
Appendix

Summary

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them. I hypothesized that my new search, which estimates the location of a value before placing probes to search for it, would be the most effective. I coded all searches in Java, tested for accuracy, and finally tested for efficiency. I tested the searches on consistent and random arrays from 100 elements to 100,000 elements.  The data supported my hypothesis, as my search proved much more efficient in terms of the number of probes used than the others when tested on both consistent and random arrays of varying sizes. Furthermore, as the number of elements increased, the number of probes used in the search I developed stayed about constant, while binary and linear searches increased exponentially. The results are significant because it is possible that searching randomly-accessible data in the future could be much faster.

Introduction

Simple searching algorithms can be used on randomly accessible data, such as data stored in an array, to quickly locate a desired value (Cook, 2005, p. 340). The two methods that are typically used for searching are binary and linear, or sequential, search (Cook, 2005, p. 241). Linear search operates with an efficiency of O(n), and binary search with a higher efficiency of O(log(n)) (See Appendix A for more information on “Big O notation”). Binary search works by cutting the dataset in half until the desired value is found. For example, when asked to find a number between 1 and 100, a guesser would guess fifty, and then be told if the number being searched for is higher or lower than that, and the guesser would guess again. This continues until the desired value is found. Linear search works by checking every value from the beginning of the list to the end until the value is found. When asked to find a certain number between 1 and 100, the guesser would guess 1, then 2, then 3, then 4, etc, until the value is found. Linear search can be very slow when used on large data sets, which is why binary search paired with a search algorithm would typically be used. Binary search is very effective at quickly locating values, but could still take a lot of time when used on big data. Although binary search is sufficiently fast for retrieving data, I wanted to make a search that is more efficient. In school, I had learned about many different algorithms used for sorting, but only of two for searching: linear search and binary Search. This lead me to wonder if a more efficient search could be created. The reasoning I used to reach my hypothesis was that humans would not use linear search or binary search when searching in an ordered list. For example, if someone was looking for the word “yak” in a dictionary, they certainly would not read each word one at a time starting at the beginning of the dictionary (how linear search works), and they would not flip to a page in the middle of the dictionary either (such as how binary search works). Instead, they would flip to a page perhaps nine tenths of the way through the dictionary, because they know that “yak” will be towards the end. I used the idea of estimating before probing to create a new way of searching ordered, randomly-accessible memory. For my experiment (testing my searches against the commonly used binary and linear searches), the searches used and the data sets they were used on were the independent variables, and the number of probes used to find a given value was the dependent variable. The purpose of this experiment was to determine which search is most effective. My hypothesis was that my search, “Estimate,” would be the most effective.

Methods and Materials

My experimental design was to take the average number of probes for 1000 runs of each search used on each dataset. The six searches tested were linear search, binary search, random to binary search (the first probe is randomly placed and all subsequent probes use binary search), random search (all probes are placed randomly), estimate to binary (first probe is placed by estimating the location, following probes are placed using binary search), and estimate (location is estimated before every probe). Each search was tested on eight data sets, four “consistent” data sets in which values matched element numbers, and four randomly generated data sets. Within the two types of arrays, there were arrays with one hundred, one thousand, ten thousand, and one hundred thousand elements. The range of numbers for each data set is 1 to the number of elements. For example, for each array with 100 elements, the range is 1 to 100. For each random data set, there were eight values being searched for: the smallest possible value, the highest possible value, a value estimated to be 5, 25, 50, 75, and 95 percent of the way into the array, and 0, which is not included in any of the arrays. For the consistent data sets, there were twelve values being searched for: the lowest possible value, the highest possible value, a value estimated to be 5, 25, 30, 40, 50, 60, 70, 75, and 95 percent of the way into the array, and 0, which is not included in the array.  It is important to check for values that are not present in the array because in real life applications it is very possible that a desired value would not exist, and when that is the case, the algorithm needs to be able to quickly identify that the given value is not present. See Appendix B for tables of the datasets and search values. The consistent array datasets were tested on more values than the random arrays because if they were tested primarily on values that fall in intervals that are multiples of  ½, ¼, and ⅛ of the way through the data, binary search’s efficiency would be inflated, because it is most efficient at those intervals. This was not necessary for the random arrays because a value estimated to be 50% into the data will typically not fall there, negating any possible advantage.  All code was written in Java, and compiled using BlueJ version 3.0.8.

The data for linear search is shown here on a separate graph because its range is much higher than that of the other searches.

In Average Probes (Random), the line for Binary is almost identical to the line for RanBin, and as such is not visible.

See Appendix C for tables of raw data.

Discussion

The results of my experiment show that my search, Estimate, was the most efficient search in terms of the number of probes it used, for both consistent and random data. My hypothesis was supported by my data; the search that estimates a value’s location before probing is the most efficient.

It is important to note that Estimate was the most efficient in terms of the number of probes it required, not necessarily in the amount of time it took to find the value. For example, binary search used three lines of code to place a new probe, and Estimate used nine. This certainly does not translate directly into amount of time taken per probe, but must be considered.

If applied to real-world datasets, Estimate could increase search efficiency, but it is unclear by how much. It is very possible that, even though Estimate used only a fraction of the number of probes that binary search did, it would only increase efficiency by a few nanoseconds.

There are many ways that this study of search efficiency could be furthered: efficiency could be tested strictly on time, the searches could be tested on more varied data sets, on much larger datasets, and on real data, and Estimate search could be improved to place probes more accurately and efficiently.

Conclusion

The purpose of my study was to determine which search used the fewest probes to find a desired value. My hypothesis was strongly supported by my data; Estimate was the most efficient search.

References

Cook, C. E. (2005). Blue pelican Java. College Station: Virtualbookworm.com.  Rowell, E. (n.d.). Know Thy Complexities! Retrieved June, 2016, from http://bigocheatsheet.com/Cook, C. E. (2005). Blue pelican Java. College Station: Virtualbookworm.com.  Rowell, E. (n.d.). Know Thy Complexities! Retrieved June, 2016, from http://bigocheatsheet.com/

Stack Overflow. (n.d.). Retrieved 2016, from http://stackoverflow.com/

Acknowledgements

I am grateful to have had Dr. David Brown as a teacher, as he allowed me to realize my passion for computer science, and taught me just about all that I know about it. Mr. Carr taught me the importance of experimental design, and how to properly conduct research. Thanks to Mr. Kramer for mentoring me and for continuing to extend my computer science knowledge

Appendix

Appendix A

The “Big O notation” of a computational function describes how efficient it is in terms of how many operations are needed to reach the goal (sorting a list, finding a certain value, inserting a node, deleting a node, etc.).

(Rowell, E, n.d)

Appendix B

Table of all datasets

Table 1.

Table of All Datasets with Search Values 

(Letters correspond to the datasets established in table 1)

Table 2

Table 3

Each letter paired with a number represents one condition that must be tested by each program.

Appendix C

Raw Data

Key for raw data below

Random

For all data from random arrays, each data point represents the average of 1000 runs using 1000 different random arrays. For each search value, every search used the same 1000 random arrays, but the 1000 arrays were different between search values. For RanBin and Random, there were 1000 runs on each of the 1000 arrays, and the average of all one million runs is shown.

Consistent

For Linear, Binary, EstBin, and Est, only 1 run was needed, because the array is always the same, and the program would always yield the same result. For RanBin and Random, the average of 1000 runs was taken.

Leave a Reply

Developing a New Search for Randomly Accessible Data

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them.

Scroll to top