It's amazing just how much information about a person you can gather on IRC. In
fact, I really can't think of any other Internet resource that automatically
offers so much information about a person. From a simple IRC
/whois
you can usually get someone's name and IP address. And from
the IP address, you can deduce their ISP and location. This is all assuming, of
course, that they include their name in their IRC client setup info and that
their hostname isn't cloaked. About a month ago I decided I wanted to see if I
could build a database of IRC users so I could search for local geeks. Why?
Well, I've met local geeks before and it's a lot of fun getting together with
people of like mind. Anyway, the following is the tale of how I went about this
endeavor.
Networks and Channel Lists
It was my goal to get the largest number of IRC users possible to increase my
chances of finding people in my small region of the world. To do this, I wanted
to find only the most populated networks and channels and not bother with the
small ones. Also the larger channels and networks would be less likely to
notice any silly little IRC bots I would create to gather my data, tee hee!
Yes, I was rather unsure that some parts of this project were 100% legal :-P
Anyway, I first went to places like searchirc.com and irc.netsplit.de to find out what the world's
most populated networks were, got the hostnames of about the top 30, and then
made a simple bot (I use the POE::Component::IRC
module in Perl to make my IRC bots, by the way. Yes, I'm a lame script kiddie
XD) to visit each one and do a /list
command. Actually, before
I made this bot, I was using a program written in C (written by zer0python) that performed the /list
on a specified server, but I
decided to write my own for the sake of flexibility of features. This worked
well only about 50% of the time since many networks truncate the output of the
list
command. Meh. I then recorded all channels with at least 30
users in a MySQL database so that I would have maximum flexibility for
manipulating the data. Perl was a wise choice of programming language for this
project since it involved extensive text parsing ;-)
Venturing Into The Channels
I then wrote a bot for simply connecting to a specified channel, joining a
specified channel, performing a /who
, sleeping for about 2 minutes,
then disconnecting. After doing a few trial runs of this maneuver, I found out
that most of the networks I was connecting to cloaked the hostnames of the all
the users. Drat. I needed to have complete hostnames or ip addresses of users
or else this whole project would be rather pointless. So I set out to find out
which of the 46 networks I had gathered didn't mess with the users' hostnames
and there were only 5:
- irc.quakenet.org
- irc.undernet.org
- irc.efnet.org
- irc.freenode.net
- irc.deltaanime.net
At this point, I decided I'd better just find out the top 100 or so most populated channels of these networks, gather all those users, and go from there. So I created a script (netsplit_info.pl) to download the most highly populated channel listings for these networks from irc.netsplit.de and another script (netsplit_import.pl) to stash the channels in the database.
Lists O' Users
Next, I had to write a script to connect to a network and visit each of the channels for that network in the database. Thus I didst create *DUN DUN DUNNNN!!!* chan_spider.pl! Yes, I know, there's a lot of repititious subroutines at the bottom of that program, but I wrote it that way because POE::Component would complain otherwise. Anyway, it's just a silly Perl script, relax. :-P It's supposed to be hideous. XD And here (link removed) is the script for parsing and dumping the output of the channel spider into the database. Pretty nifty, aye?
After spidering freenode, I had about 18,000 user records in my database. But
after using the SQL distinct
function, I found that I only had
around 4,000 distinct users. This is partially because A) most users are in
multiple channels and B) my channel spider script was quite buggy and did stupid
things like performing a /who
multiple times in the same channel
and other bizarrities. I'm not sure whether this was a bug in
POE::Component::IRC or my code. If you have a comment on this, feel free to
give me a holler. :-P So, to find out whether or not these people lived in my
vicinity, I found out the netranges of local ISPs in my area using the
whois
program, which is available for all *nix operating systems.
For those of you who aren't aware, when you give whois
an ip
address, it will look it up in one of the four Network Information Centers
(NICs): APNIC, ARIN, LACNIC and RIPE NCC. The whois
program then
returns information about what ISP owns the ip address, the netrange that the ip
address belongs to, and some other information.
However, most of the user records in the database had hostnames that needed to be resolved into ip addresses before I could do an ip address netrange search. So I created a program to do that :-)
DNS Mayhem
So, I had approximately 3,000 DNS lookups to do. I couldn't do this with my
ISP's DNS servers since they might notice and shut off my net connection and
that would be a bad thing ;-p Many ISPs leave their DNS servers open to anyone
who happens to want to use them to do a DNS resolution, so I compiled a list of
several of these servers I found in the Seattle and Denver areas. 64 of them,
to be exact. I just googled for Seattle and Denver ISPs since those cities are
bound to have some pretty large ISPs and probably wouldn't notice a few hundred
DNS lookups :-P I found out their nameservers by, again, using the oh-so-handy
whois
program.
The Not-So-Exciting Conclusion
After a couple hours of dns_lookup.pl chugging along, it finished and I feverishly tried my first netrange, which was one of cablone.net's Pocatello netranges:
mysql> select * from users where ip between '24.116.152.0' and '24.116.159.255'; +-----+---------+----------+---------------------------------+-----------+-----------+----------------+ | id | chan_id | username | hostname | nick | ircname | ip | +-----+---------+----------+---------------------------------+-----------+-----------+----------------+ | 523 | 211 | furuba | 24-116-157-215.cpe.cableone.net | buckminst | buckminst | 24.116.157.215 | +-----+---------+----------+---------------------------------+-----------+-----------+----------------+ 1 row in set (0.03 sec)
Sigh. It's Bucky. Well, let's try another one! This time, lets try all the Qwest hostnames for Southeast Idaho! *types madly in MySQL!*
mysql> select * from users where hostname like '%bois.qwest.net'; +------+---------+----------+--------------------------------+------+---------+-----------------+ | id | chan_id | username | hostname | nick | ircname | ip | +------+---------+----------+--------------------------------+------+---------+-----------------+ | 4587 | 305 | tlp | 168-103-130-145.bois.qwest.net | tlp | tlp | 168.103.130.145 | +------+---------+----------+--------------------------------+------+---------+-----------------+ 1 row in set (0.02 sec)
ARGH! It's tlp! *groan* Why can't Sexy_Linux_ChiX0r be found?!?!?
Well, in conclusion, there's a lot of people on IRC.... but not THAT many. Searchirc.com records an average of a little over 1,000,000 which, spread out over planet earth, is actually pretty sparce. Ah, well, I learned a lot of nifty things with this project, even though it was a miserable failure. Ok, maybe not totally a failure... like, I can PM Bucky and tlp on IRC and say something like "w00t! my lil channel spider scriptie thingie found you!! ARRR HAR HAR HAR HAR!!!!". *ahem*