Shostack + Friends Blog Archive

 

AOL search records 'research'

Most readers will have read by now of America Online publicly releasing a large sample of search records.
From the README supplied with the data:

The data set includes {AnonID, Query, QueryTime, ItemRank, ClickURL}.
AnonID - an anonymous user ID number.
Query  - the query issued by the user, case shifted with
most punctuation removed.
QueryTime - the time at which the query was submitted for search.
ItemRank  - if the user clicked on a search result, the rank of the
item on which they clicked is listed.
ClickURL  - if the user clicked on a search result, the domain portion of
the URL in the clicked result is listed.

There are about 20 million queries acccording to AOL, from about 650 thousand sources.
Some fun facts:
260 records match the SSN regex

/(?!000)([0-6]d{2}|7([0-6]d|7[012]))([ -]+?)(?!00)dd3(?!0000)d{4}/

.
A contributor to the interesting-people list reports somewhat fewer matches, but perhaps (s)he has a more discriminating regex, or cleaned the results.
Of the ‘SSN matches’, one also contained what appeared to be a person’s full name, address, date of birth, and driver’s license number (with state of issue).
OTOH, an extremely primitive “credit card number” regex yielded only four hits. I’m having some issues with the Regexp::Common:CC perl package, so I rolled my own regex and I know it is terrible.

2 comments on "AOL search records 'research'"

Comments are closed.