eBayŽ Land Auctions -  Who,  Where and How Much?
 
Author

Steven Raddigan
American River College, Geography 350: Data Acquisition in GIS; Spring 2004
 
Abstract

This project focuses on the process of utilizing custom software to extract specific data elements from thousands of similar web pages over an extended period of time. The project involved database design, software application development and data integration with ArcMap8.3.  The eBayŽ internet website was chosen because of the dynamic nature of the site, the volume of data available for collection and the spatial orientation. 
 
 
Introduction

Founded in September 1995, eBayŽ, is The World's Online Marketplace™ for the sale of goods and services by a diverse community of individuals and businesses. (USA 2010) eBayŽ has approximately 30 main categories which are further divided into hundreds of sub-categories with each hosting thousands of auctions daily.  The Real Estate category is divided into five sub-categories:  Commercial, Land, Residential, Timeshares, and Other.  This project will focus on the eBayŽ Land auctions.  There are nearly 1000 auctions occurring daily and an additional 80 - 100 new items listed each day.  Who is selling? Where are they selling? How many dollars are involved? 

This project will develop a method to acquire the relative data from the thousands of web pages that describe these auctions and spatially illustrate the results.  The process will continue down into the county level in order to determine which counties are the most prevalent for the two states having the highest number of auctions.

The goal is to develop a method that aids in the collection, organization, manipulation and analysis of the raw data from the eBayŽ internet site.  The project will involve the development of a software application written in Visual FoxPro that will interface with the eBayŽ internet pages.  ArcMap8.3 will be used to spatially display the results.

This emphasis of this project is not on the results of the eBayŽ land auctions, but is about the process and development of the application that acquired the data.

 
Background

Almost everyone has at one time or another looked at the eBayŽ website.  There is a large amount of information available for viewing that would be very time consuming to manually organize in order to study a particular item or category.  eBayŽ is aware of the vast amount of information available and the usefulness to others in either researching items in hopes of joining the millions of eBayŽ sellers or to a very conservative buyer who wants to pay a fair price for an item, but they closely guard the information. 

Not many public or private entities offer access to their databases  Many websites have pages of pertinent information displayed but additional analysis or different presentation of the data is necessary to meet other requirements of a research project. Some organizations have a significant amount of relative data for a specific research project, but the organization of the website pages and the need to load and read and each page is time consuming.  Depending on the depth and scope of the project, gathering the data using pen and paper can almost be prohibitive. 

Can you ever have to much data?  Only if the data is unorganized or irretrievable. This is when the researcher becomes overwhelmed and buried in piles of paper.  I  was not going to spend a couple of months printing out pages for the eBayŽ land auctions and then manually enter information into a spreadsheet.   I was not able to find a software application that can mole through web pages extracting data and populating fields into a database.  So, I created one that could..

 
Methods

The process began with a close study of the eBayŽ land auction pages.  There were two different types of pages that needed to be reviewed:  1)  The main listing page that provides a list of all current auctions and 2) the auction item page that provides the specific auction item information.  

From this review, a database was developed that would serve as the collection center for the data elements.  The auction item page was the page with the specific information and lead to the identification of several data elements:

STRUCTURE FOR AUCTION ITEM TABLE

Field Name

Field Type

Purpose

Source

BEGSOURCE Memo Html Page Source Application
ENDSOURCE Memo Html Page Source Application
ITEMID Character Unique Identity Ebay html source extraction
DESC Character Item Description - Line 1 Ebay html source extraction
DESC2 Character Item Description - Line 2 Ebay html source extraction
STARTED Character Date auction started (ascii) Ebay html source extraction
ENDS Character Date auction ended (ascii) Ebay html source extraction
STARTDATE DateTime Date auction started Application
ENDDATE DateTime Date auction ended Application
SELLERID Character Seller number Ebay html source extraction
SELLER Character Seller name Ebay html source extraction
HIGHBIDID Character High bidder id number Ebay html source extraction
HIGHBIDDER Character High bidder name Ebay html source extraction
BIDAMT Character Bid amount Ebay html source extraction
LOANAMT Numeric Dollar value of loan assumption Manual
STATE_NAME Character State Ebay html source extraction
CITY Character City Ebay html source extraction
COUNTY Character County Application/Manual
ZIPCODE Character Zip code Ebay html source extraction
PERIOD Number Number of auction days Ebay html source extraction
BIDCNT Numeric Number of bids Ebay html source extraction
ADONLY Logical Advertisement flag Ebay html source extraction
DOWNPAYMENT Logical Downpayment flag Application/Manual
ASSUME Character String collection for review Ebay html source extraction
RESERVE Logical Reserve met or not met Ebay html source extraction
RELISTED Logical Indicates item was relisted Ebay html source extraction
RELISTNUM Character References item relisted number Ebay html source extraction
CNTYAPPROVED Logical County approval flag Application/Manual
AMTAPPROVED Logical Loan amount approval flag Application/Manual

A graphical user interface was developed using Visual FoxPro.  The designed used object oriented programming and was divided in four page frames to allow the greatest access for data manipulation.

The main listing pages were used to extract the auction item identification number for each active auction.  The auction identification numbers are not visible in the main listing screen.  These numbers have to be extracted from the html source of the main listing page.  The initial program function of acquiring each active auction item number had to determine the total number of main listing pages and cycle through each one. A temporary table was used to collect these numbers each time the program was run.  Each number was checked against an array containing all auction item numbers from the auction item table. Those numbers that were found to be in both tables were ignored while numbers not found in the auction item table were then appended to the auction item table.

The next function required the application to acquire the beginning html source for the auction items. Once the beginning source was captured, executing the additional functions extracted the beginning dates and ending dates.  These dates needed to be modified to a datetime type field for comparison with the computer date.  This was necessary in order to search the eBayŽ database for the inactive auctions and extract the html source for these auctions.

Once the ending html source code was acquired, the process of extracting the remaining data elements began.  The majority of the data elements were located in a consistent manner.  Below is a snippet of an html source and the associated code necessary to extract the "Winning Bid Amount" data element.

HTML SOURCE

WINNING BID DATA ELEMENT EXTRACTION CODE
<table border="0" cellspacing="0" cellpadding="6">
<tr>
<td valign="top" nowrap>
<font face="Arial" size="2">
Winning bid:
</font>
</td>
<td width="100%">
<font face="Arial" size="2"><b>US $1,525.00<font face="Arial" size="2" color="#666666">
</td>

 

aloc = At("Winning bidder", lchtml)
If aloc != 0
aloc = At("Winning bid:", lchtml)
amount = Substr(lchtml,aloc,125)
bamount = At("$",amount)
eamount = At(".",amount)
Replace bidamt With Substr(amount,bamount+1,(eamount-bamount)+ 2)
Endif

Additional data elements were collected in a similar manner, but may have required significant If-Then statements depending on the complexity of the data element and the options used by eBayŽ to display the data.

The data collection process began February 10, 2004 and continued through April 18, 2004.  There were 7430 records added to the auction item table.  A criterion was set for the acceptable dataset to be used for the analysis.  This criterion were:

  • The auction item had to have been active on February 15, 2004
  • The auction had to been completed by April 15, 2004
  • The location for the auction item had to be within the United States
  • Advertisement were not to be included
  • Incomplete auctions were not to be included. (These were auctions with no bids or deleted by the seller.)
  • All duplicate auction items had to be excluded

After the criterion were applied, the remaining dataset contained 3686 records.  An analysis on these records was conducted to determine the state from which these properties were located.  Once the data was assembled, the process of determining which states had the highest number of auctions was pretty straight forward, but not very challenging and rewarding.  I decided to take a look at the auctions within each state and determine where in the state the auctions were occurring.  The two states having the highest number of auctions were further analyzed to determine the counties within the state.  In addition, the dollar value for these auctions in the two highest states was also determine.  The top sellers were determined from the entire dataset.  These analysis resulted in additional tables in the application.  These state count table and the county count table were then joined to the state and county layers within ArcMap.  The top sellers for the entire dataset were presented in a tabular format.  The dollar value for the auctions occurring in the two states having the most auctions was presented on the state county map.

 
Results
 
An analysis of the data revealed that Arkansas had the most land auction sales during the reporting period followed by the Texas.  This information is presented spatially using the following three figures.

eBayŽ Land Auctions
February 15, 2004 to April 15, 2004



Arkansas Land Auctions By County
February 15, 2004 to April 15, 2004

$ 919,869.65


Texas Land Auctions By County
February 15, 2004 to April 15, 2004
$ 3,225,794.92
 


A study of the entire dataset determined the top three sellers further broken down to the state level as presented below:

TOP THREE SELLERS FOR THE UNITED STATES
February 15, 2004 to April 15, 2004

nationallandcompany - 245
Arkansas - 243
Missouri - 2

governmentauction - 181
North Dakota - 25
Arizona-21
Arkansas - 21
California - 21
Colorado - 16
Nevada - 16
Texas - 16
Oregon - 10
Florida - 3
Hawaii - 4
Kansas - 5
Michigan - 2
South Dakota - 1

tmas1- 143
Michigan - 88
Texas - 42
Arizona - 6
California - 4
Nebraska - 3

 

Analysis
 
The results are evident of that this process will work.  A significant amount of data was collected, organized, analyzed and presented in an illustrative manner using the internet with custom and off-the-shelf software applications.  I would like to report that it all went well without any problems, but that was not the case.  The custom application was written to find a particular phase referencing the data element in the html source.  The original program determined the number of pages  to cycle through by locating "page" in the html source.  During the data collection phase, eBayŽ changed the phrase to "Page" which resulted in the program processing only the first page.  This was quickly caught because of the number of records being added to the database was abnormal and that a screen notation only showed one page being processed.    Another problem occurred during the last two weeks of data collection when duplicates started to appear in the database.  Because I only had two more weeks of data collection, I allowed the duplicates to occur and deleted them later. 

After the count was in for the total numbers of auctions occurring within a state, I felt a little empty.  The process worked, but one map of the United States was not very rewarding and I felt it really didn't demonstrate the full potential utilizing this type of process of data acquisition..  I decided to go deeper into the data, down to the county level.  This is where the challenge and a very time consuming process emerged.  None of the html source pages had a standard for mining out the county name.  I decided to write a small routine that would use the ESRI county table and search the html source for county names occurring within the state identified.  The process would look for a match in an similar manner used  for  the other data fields.  The problem was that the list was in an alphabetical order and the first match moved the search to the next record.  For those states having a lake referenced in the html source as well as a Lake county within the state produced unreliable data.  Other problems occurred when the html source was describing landmarks and points of interest that also corresponded with county names.  The only way to be sure that there was some integrity to the data was to review each html source for the correct county name. If no county name was in the html source, I needed to located the city and then find the county it was in.  I searched the census website but was unable to located any table that listed states, counties and cities.  I used three other web sites extensively for this process: www.city-data.com, www.google.com, http://txdot.lib.utexas.edu/.  When a location was acceptable and a group of records were identified, I was able to populate the county in batches. 

In order to get the correct dollar amount , I found it necessary to determine what auctions were only bids on a down payment and search out the assumed load amount.  This was a more tedious task than determining the county name.  Again, there were no standard search phrases for an assumed loan.  Each html source was viewed individually and the data was manually entered into the appropriate field.  To keep track of which items had been processed, I modified the original table structure to include the county and amount approval fields.  These two fields allowed me to manage the data through this process and served as the indicator that all records were processed.  Because of the extensive amount of time involved in determining the proper county and dollar value for each item, I limited the process to the two states having the highest number of auctions.

 
Conclusions
 
The possibility exists for other data collection projects to be successful in utilizing this method for their data collection process..  This method has proven to work well for gathering multiple data elements from thousands of similar web pages in a short period and of also being successful over an extended period of time.  The later would require monitoring the web pages to make sure that the author did not change the page to a degree that would cause the locate methods to fail.

After my experience with the county name and dollar value, I would recommend that a small subset of the data be used to develop the entire process from cradle to grave.  This would identify any problems that would have to be addressed before the final analysis could occur.  If I had done so myself, I would have been aware of the county and dollar data element collection issue sooner than two weeks before the due date.  As far as the county data collection, I would have had the time to develop a similar application that could have gone through the www.city-data.com web pages and created a table that would have attributes of states, counties and cities as well as all the other data I could extract from these pages.

An awareness of the data collection issue of the dollar value data element would have motivated me to process the data as it was collected and not find myself, as I did with the county name, up against the dead line.  

 
References
 
ESRI State and County Shape Files
www.city-data.com