Developing a New Search for Randomly Accessible Data

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them.

Abstract

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them. I hypothesized that my new search, which estimates the location of a value before placing probes to search for it, would be the most effective. I coded all searches in Java, tested for accuracy, and finally tested for efficiency. I tested the searches on consistent and random arrays from 100 elements to 100,000 elements. The data supported my hypothesis, as my search proved much more efficient in terms of the number of probes used than the others when tested on both consistent and random arrays of varying sizes. Furthermore, as the number of elements increased, the number of probes used in the search I developed stayed about constant, while binary and linear searches increased exponentially. The results are significant because it is possible that searching randomly-accessible data in the future could be much faster.

Simple searching algorithms can be used on randomly accessible data, such as data stored in an array, to quickly locate a desired value (Cook, 2005, p. 340). The two methods that are typically used for searching are binary and linear, or sequential, search (Cook, 2005, p. 241). Linear search operates with an efficiency of O(n), and binary search with a higher efficiency of O(log(n)) (See Appendix A for more information on “Big O notation”). Binary search works by cutting the dataset in half until the desired value is found. For example, when asked to find a number between 1 and 100, a guesser would guess fifty, and then be told if the number being searched for is higher or lower than that, and the guesser would guess again. This continues until the desired value is found. Linear search works by checking every value from the beginning of the list to the end until the value is found. When asked to find a certain number between 1 and 100, the guesser would guess 1, then 2, then 3, then 4, etc, until the value is found. Linear search can be very slow when used on large data sets, which is why binary search paired with a search algorithm would typically be used. Binary search is very effective at quickly locating values, but could still take a lot of time when used on big data. Although binary search is sufficiently fast for retrieving data, I wanted to make a search that is more efficient. In school, I had learned about many different algorithms used for sorting, but only of two for searching: linear search and binary Search. This lead me to wonder if a more efficient search could be created. The reasoning I used to reach my hypothesis was that humans would not use linear search or binary search when searching in an ordered list. For example, if someone was looking for the word “yak” in a dictionary, they certainly would not read each word one at a time starting at the beginning of the dictionary (how linear search works), and they would not flip to a page in the middle of the dictionary either (such as how binary search works). Instead, they would flip to a page perhaps nine tenths of the way through the dictionary, because they know that “yak” will be towards the end. I used the idea of estimating before probing to create a new way of searching ordered, randomly-accessible memory. For my experiment (testing my searches against the commonly used binary and linear searches), the searches used and the data sets they were used on were the independent variables, and the number of probes used to find a given value was the dependent variable. The purpose of this experiment was to determine which search is most effective. My hypothesis was that my search, “Estimate,” would be the most effective.

Methods and Materials

My experimental design was to take the average number of probes for 1000 runs of each search used on each dataset. The six searches tested were linear search, binary search, random to binary search (the first probe is randomly placed and all subsequent probes use binary search), random search (all probes are placed randomly), estimate to binary (first probe is placed by estimating the location, following probes are placed using binary search), and estimate (location is estimated before every probe). Each search was tested on eight data sets, four “consistent” data sets in which values matched element numbers, and four randomly generated data sets. Within the two types of arrays, there were arrays with one hundred, one thousand, ten thousand, and one hundred thousand elements. The range of numbers for each data set is 1 to the number of elements. For example, for each array with 100 elements, the range is 1 to 100. For each random data set, there were eight values being searched for: the smallest possible value, the highest possible value, a value estimated to be 5, 25, 50, 75, and 95 percent of the way into the array, and 0, which is not included in any of the arrays. For the consistent data sets, there were twelve values being searched for: the lowest possible value, the highest possible value, a value estimated to be 5, 25, 30, 40, 50, 60, 70, 75, and 95 percent of the way into the array, and 0, which is not included in the array. It is important to check for values that are not present in the array because in real life applications it is very possible that a desired value would not exist, and when that is the case, the algorithm needs to be able to quickly identify that the given value is not present. See Appendix B for tables of the datasets and search values. The consistent array datasets were tested on more values than the random arrays because if they were tested primarily on values that fall in intervals that are multiples of ½, ¼, and ⅛ of the way through the data, binary search’s efficiency would be inflated, because it is most efficient at those intervals. This was not necessary for the random arrays because a value estimated to be 50% into the data will typically not fall there, negating any possible advantage. All code was written in Java, and compiled using BlueJ version 3.0.8.

Results

The data for linear search is shown here on a separate graph because its range is much higher than that of the other searches.

In Average Probes (Random), the line for Binary is almost identical to the line for RanBin, and as such is not visible.

See Appendix C for tables of raw data.

Discussion

The results of my experiment show that my search, Estimate, was the most efficient search in terms of the number of probes it used, for both consistent and random data. My hypothesis was supported by my data; the search that estimates a value’s location before probing is the most efficient.

It is important to note that Estimate was the most efficient in terms of the number of probes it required, not necessarily in the amount of time it took to find the value. For example, binary search used three lines of code to place a new probe, and Estimate used nine. This certainly does not translate directly into amount of time taken per probe, but must be considered.

If applied to real-world datasets, Estimate could increase search efficiency, but it is unclear by how much. It is very possible that, even though Estimate used only a fraction of the number of probes that binary search did, it would only increase efficiency by a few nanoseconds.

There are many ways that this study of search efficiency could be furthered: efficiency could be tested strictly on time, the searches could be tested on more varied data sets, on much larger datasets, and on real data, and Estimate search could be improved to place probes more accurately and efficiently.

Conclusion

The purpose of my study was to determine which search used the fewest probes to find a desired value. My hypothesis was strongly supported by my data; Estimate was the most efficient search.

References

  1. Cook, C. E. (2005). Blue pelican Java. College Station: Virtualbookworm.com.
  2. Rowell, E. (n.d.). Know Thy Complexities! Retrieved June, 2016, from http://bigocheatsheet.com/
  3. Stack Overflow. (n.d.). Retrieved 2016, from http://stackoverflow.com/

Acknowledgements

I am grateful to have had Dr. David Brown as a teacher, as he allowed me to realize my passion for computer science, and taught me just about all that I know about it. Mr. Carr taught me the importance of experimental design, and how to properly conduct research. Thanks to Mr. Kramer for mentoring me and for continuing to extend my computer science knowledge.

Appendix A

Big O Notation

The “Big O notation” of a computational function describes how efficient it is in terms of how many operations are needed to reach the goal (sorting a list, finding a certain value, inserting a node, deleting a node, etc.).

(Rowell, E, n.d)

Appendix B

Table of all datasets

 

100 elements

1,000 elements

10,000 elements

100,000 elements

Consistent

A

B

C

D

Random

E

F

G

H

Table 1.

Table of All Datasets with Search Values

(Letters correspond to the datasets established in table 1)

Dataset

0%

5%

25%

30%

40%

50%

60%

70%

75%

95%

100%

0

A

A1

A2

A3

A4

A5

A6

A7

A8

A9

A10

A11

A12

B

B1

B2

B3

B4

B5

B6

B7

B8

B9

B10

B11

B12

C

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

C11

C12

D

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

D12

Table 2

Dataset

0%

5%

25%

50%

75%

95%

100%

0

E

A1

A2

A3

A4

A5

A6

A7

A8

F

B1

B2

B3

B4

B5

B6

B7

B8

G

C1

C2

C3

C4

C5

C6

C7

C8

H

D1

D2

D3

D4

D5

D6

D7

D8

Table 3

Each letter paired with a number represents one condition that must be tested by each program.

Appendix C

Raw Data

 

Searched for value

 

Search used

 

Number of elements in array

 

Average of row

Key for raw data below

Random

For all data from random arrays, each data point represents the average of 1000 runs using 1000 different random arrays. For each search value, every search used the same 1000 random arrays, but the 1000 arrays were different between search values. For RanBin and Random, there were 1000 runs on each of the 1000 arrays, and the average of all one million runs is shown.

100

1

5

25

50

75

95

100

0

100

Linear

36.2

40.921

49.885

67.036

84.905

97.42

101

101

72.29588

Binary

6.265

6.232

6.019

6.003

6.195

6.354

8

7

6.5085

RanBin (1000 avg)

5.887728

5.995563

6.329254

6.524723

6.486642

6.038911

7.720701

6.728629

6.464019

Random(1000 avg)

4.999793

6.527926

7.798838

8.093585

7.938542

6.510032

7.185795

5.711049

6.845695

EstBin

1

4.928

5.828

5.914

5.781

3.947

1

1

3.67475

Est

1

2.221

2.897

2.89

2.888

2.226

1

1

2.01525

1000: 1-1000

1

50

250

500

750

950

1000

0

1,000

Linear

370

403.779

527.347

685.394

837.754

968.305

1001

1001

724.3224

Binary

9.3

9.431

9.282

9.307

9.262

9.411

11

10

9.624125

RanBin (1000 avg)

9.172718

9.300086

9.758573

9.977423

9.752822

9.225938

10.975175

9.975595

9.767291

Random(1000 avg)

7.30911

11.130077

12.446807

12.759983

12.439991

11.045309

9.485426

8.051348

10.58351

EstBin

1

8.224

9.127

9.268

9.096

7.877

1

1

5.824

Est

1

3.254

3.527

3.604

3.506

3.169

1

1

2.5075

10,000

1

500

2500

5000

7500

9500

10000

0

10,000

Linear

3691

4005.238

5177.492

6866.562

8458.405

9702.34

10001

10001

7237.88

Binary

13.095

12.752

12.783

12.871

12.746

12.795

15

14

13.25525

RanBin (1000 avg)

12.538485

12.608171

13.100695

13.364917

13.171166

12.683484

14.364464

13.360973

13.14904

Random(1000 avg)

9.611969

15.723728

17.058388

17.423006

17.131926

15.793367

11.788359

10.358783

14.36119

EstBin

1

11.389

12.528

12.784

12.531

11.409

1

1

7.955125

Est

1

3.699

4.054

4.075

3.946

3.722

1

1

2.812

100,000

1

5000

25000

50000

75000

95000

100000

0

100,000

Linear

36701

40434.275

53272.408

68094.365

84178.807

96945.439

100001

100001

72453.54

Binary

16.282

16.14

16.143

16.124

16.02

16.166

18

17

16.48438

RanBin (1,000 avg)

15.857128

15.908077

16.491563

16.624851

16.448232

15.964562

17.690557

16.689483

16.45931

Random(1,000 avg)

11.906298

20.293071

21.721988

21.967444

21.677897

20.359816

14.090079

12.664054

18.08508

EstBin

1

14.787

15.952

16.08

15.825

14.889

1

1

10.06663

Est

1

4.1

4.347

4.306

4.249

4.163

1

1

3.020625

Consistent

For Linear, Binary, EstBin, and Est, only 1 run was needed, because the array is always the same, and the program would always yield the same result. For RanBin and Random, the average of 1000 runs was taken.

100

1

5

25

30

40

50

60

70

75

95

100

0

100

Linear

1

5

25

30

40

50

60

70

75

95

100

101

54.33333333

Binary

6

7

2

7

5

1

6

7

2

6

7

7

5.25

RanBin (1000 avg)

5.709

5.515

6.25

5.906

6.192

6.333

6.074

6.167

6.289

5.533

6.712

6.666

6.112166667

Random(1000 avg)

4.844

6.107

7.553

7.59

7.681

7.84

7.62

7.627

7.579

6.559

6.254

5.746

6.916666667

EstBin

1

1

1

1

1

1

6

1

1

1

1

1

1.416666667

Est

1

1

1

1

1

1

2

1

1

1

1

1

1.083333333

1000

1

50

250

300

400

500

600

700

750

950

1000

0

1000

Linear

1

50

250

300

400

500

600

700

750

950

1000

1001

541.8333333

Binary

9

8

2

8

9

1

10

9

2

10

10

10

7.333333333

RanBin (1000 avg)

8.962

8.936

9.428

9.445

9.64

9.524

9.611

9.461

9.378

8.77

9.996

9.997

9.429

Random(1000 avg)

7.034

10.66

12.188

12.163

12.213

12.315

12.246

12.237

12.407

10.793

8.433

8.082

10.89758333

EstBin

1

1

1

1

1

1

1

1

1

1

1

1

1

Est

1

1

1

1

1

1

1

1

1

1

1

1

1

10,000

1

500

2500

3000

4000

5000

6000

7000

7500

9500

10000

0

10,000

Linear

1

500

2500

3000

4000

5000

6000

7000

7500

9500

10000

10001

5416.833333

Binary

13

13

2

13

12

1

11

14

2

14

14

14

10.25

RanBin (1000 avg)

12.377

12.196

12.652

12.701

12.936

12.962

12.862

12.822

12.6

12.177

13.286

13.36

12.74425

Random(1000 avg)

9.444

15.094

16.737

16.765

16.869

16.96

17.012

16.951

16.656

15.351

10.714

10.474

14.91891667

EstBin

1

14

1

1

1

1

1

1

1

1

1

1

2.083333333

Est

1

2

1

1

1

1

1

1

1

1

1

1

1.083333333

100,000

1

5000

25000

30000

40000

50000

60000

70000

75000

95000

100000

0

100,000

Linear

1

5000

25000

30000

40000

50000

60000

70000

75000

95000

100000

100001

54166.83333

Binary

16

15

2

15

17

1

16

19

2

16

17

17

12.75

RanBin (1000 avg)

15.754

15.595

16.042

16.075

16.135

16.218

16.254

18.901

16.015

15.543

16.679

16.756

16.33058333

Random(1000 avg)

11.721

19.904

21.504

21.444

21.72

21.871

21.88

24.922

21.217

20.043

13.041

12.676

19.32858333

EstBin

1

1

1

1

1

1

1

1

1

1

1

1

1

Est

1

1

1

1

1

1

1

1

1

1

1

1

1

Leave a Reply

Developing a New Search for Randomly Accessible Data

There are only two widely-used searches for randomly accessible data: linear search and binary search. I developed and created a new search and tested it against binary search, linear search, and variations of them.

Scroll to top