There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them.

**Contents**

Summary

Introduction

Materials and Methods

Discussion

Conclusion

References

Acknowledgements

Appendix

**Summary**

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them. I hypothesized that my new search, which estimates the location of a value before placing probes to search for it, would be the most effective. I coded all searches in Java, tested for accuracy, and finally tested for efficiency. I tested the searches on consistent and random arrays from 100 elements to 100,000 elements. The data supported my hypothesis, as my search proved much more efficient in terms of the number of probes used than the others when tested on both consistent and random arrays of varying sizes. Furthermore, as the number of elements increased, the number of probes used in the search I developed stayed about constant, while binary and linear searches increased exponentially. The results are significant because it is possible that searching randomly-accessible data in the future could be much faster.

My experimental design was to take the average number of probes for 1000 runs of each search used on each dataset. The six searches tested were linear search, binary search, random to binary search (the first probe is randomly placed and all subsequent probes use binary search), random search (all probes are placed randomly), estimate to binary (first probe is placed by estimating the location, following probes are placed using binary search), and estimate (location is estimated before every probe). Each search was tested on eight data sets, four “consistent” data sets in which values matched element numbers, and four randomly generated data sets. Within the two types of arrays, there were arrays with one hundred, one thousand, ten thousand, and one hundred thousand elements. The range of numbers for each data set is 1 to the number of elements. For example, for each array with 100 elements, the range is 1 to 100. For each random data set, there were eight values being searched for: the smallest possible value, the highest possible value, a value estimated to be 5, 25, 50, 75, and 95 percent of the way into the array, and 0, which is not included in any of the arrays. For the consistent data sets, there were twelve values being searched for: the lowest possible value, the highest possible value, a value estimated to be 5, 25, 30, 40, 50, 60, 70, 75, and 95 percent of the way into the array, and 0, which is not included in the array. It is important to check for values that are not present in the array because in real life applications it is very possible that a desired value would not exist, and when that is the case, the algorithm needs to be able to quickly identify that the given value is not present. See Appendix B for tables of the datasets and search values. The consistent array datasets were tested on more values than the random arrays because if they were tested primarily on values that fall in intervals that are multiples of ½, ¼, and ⅛ of the way through the data, binary search’s efficiency would be inflated, because it is most efficient at those intervals. This was not necessary for the random arrays because a value estimated to be 50% into the data will typically not fall there, negating any possible advantage. All code was written in Java, and compiled using BlueJ version 3.0.8.

The data for linear search is shown here on a separate graph because its range is much higher than that of the other searches.

In Average Probes (Random), the line for Binary is almost identical to the line for RanBin, and as such is not visible.

See Appendix C for tables of raw data.

The results of my experiment show that my search, Estimate, was the most efficient search in terms of the number of probes it used, for both consistent and random data. My hypothesis was supported by my data; the search that estimates a value’s location before probing is the most efficient.

It is important to note that Estimate was the most efficient in terms of the number of probes it required, not necessarily in the amount of time it took to find the value. For example, binary search used three lines of code to place a new probe, and Estimate used nine. This certainly does not translate directly into amount of time taken per probe, but must be considered.

If applied to real-world datasets, Estimate could increase search efficiency, but it is unclear by how much. It is very possible that, even though Estimate used only a fraction of the number of probes that binary search did, it would only increase efficiency by a few nanoseconds.

There are many ways that this study of search efficiency could be furthered: efficiency could be tested strictly on time, the searches could be tested on more varied data sets, on much larger datasets, and on real data, and Estimate search could be improved to place probes more accurately and efficiently.

**Conclusion**

The purpose of my study was to determine which search used the fewest probes to find a desired value. My hypothesis was strongly supported by my data; Estimate was the most efficient search.

Cook, C. E. (2005). Blue pelican Java. College Station: Virtualbookworm.com. Rowell, E. (n.d.). Know Thy Complexities! Retrieved June, 2016, from http://bigocheatsheet.com/Cook, C. E. (2005). Blue pelican Java. College Station: Virtualbookworm.com. Rowell, E. (n.d.). Know Thy Complexities! Retrieved June, 2016, from http://bigocheatsheet.com/

Stack Overflow. (n.d.). Retrieved 2016, from http://stackoverflow.com/

I am grateful to have had Dr. David Brown as a teacher, as he allowed me to realize my passion for computer science, and taught me just about all that I know about it. Mr. Carr taught me the importance of experimental design, and how to properly conduct research. Thanks to Mr. Kramer for mentoring me and for continuing to extend my computer science knowledge

**Appendix**

**Appendix A**

The “Big O notation” of a computational function describes how efficient it is in terms of how many operations are needed to reach the goal (sorting a list, finding a certain value, inserting a node, deleting a node, etc.).

(Rowell, E, n.d)

Table of all datasets

__Table 1.__

Table of All Datasets with Search Values

(Letters correspond to the datasets established in table 1)

__Table 2__

__Table 3__

Each letter paired with a number represents one condition that must be tested by each program.

**Appendix C**

Raw Data

Key for raw data below

__Random__

For all data from random arrays, each data point represents the average of 1000 runs using 1000 different random arrays. For each search value, every search used the same 1000 random arrays, but the 1000 arrays were different between search values. For RanBin and Random, there were 1000 runs on each of the 1000 arrays, and the average of all one million runs is shown.

__Consistent__

For Linear, Binary, EstBin, and Est, only 1 run was needed, because the array is always the same, and the program would always yield the same result. For RanBin and Random, the average of 1000 runs was taken.

You must log in to like an article

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them.