GeoIP Query and Convert

最近工作需要开发爬虫抓取别人网页的数据,由于IP有限为避免被封只能采取各种混淆手段,其中一种是伪造X-Forwarded-For的请求头,尽管起作用的可能很小(若是Nginx上做限制的话直接无视这个……),但毕竟多一个算一个。既然做了就得做专业一点,那混淆的IP当然就不能随机了,必须满足确定的地理位置,最后发现国外有家MaxMind提供免费的GeoIP数据。下载回来的数据中包含两个文件,其中一个是IP数据,另外一个是对应的地理信息:

GeoLiteCity-Blocks.csv
1
2
3
4
5
6
Copyright (c) 2011 MaxMind Inc.  All Rights Reserved.
startIpNum,endIpNum,locId
"16777216","16777471","17"
"16777472","16777727","104084"
"16777728","16778239","49"
"16778240","16778751","14409"
GeoLiteCity-Location.csv
1
2
3
4
5
6
7
8
Copyright (c) 2012 MaxMind LLC.  All Rights Reserved.
locId,country,region,city,postalCode,latitude,longitude,metroCode,areaCode
1,"O1","","","",0.0000,0.0000,,
2,"AP","","","",35.0000,105.0000,,

414266,"CN","30","Hegu","",22.5970,112.8044,,
414267,"CN","16","Pingchihsu","",22.1500,108.7500,,
414268,"CN","30","Dadu","",22.9785,113.1684,,

数据基本上都很好理解,唯一一个问题是IP地址是用整形表示的,要还原出我们熟悉的IP格式的话还需要对其进行转换:

Long/IP Convert
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import socket
import binascii


def ip2long(ip_addr):
"""
Wrapper function for IPv4 and IPv6 coverters
@param ip_addr: IPv4 or IPv6 address
@type ip_addr: str
"""

try:
return int(binascii.hexlify(socket.inet_aton(ip_addr)), 16)
except socket.error:
return int(binascii.hexlify(socket.inet_pton(socket.AF_INET6, ip_addr)), 16)


def long2ip(ip_num):
"""
Wrapper function for IPv4 and IPv6 soverters
@param ip_num: Integer represetation IP address
@type ip_num: long
"""

try:
return socket.inet_ntoa(binascii.unhexlify('%08x' % ip_num))
except socket.error:
return socket.inet_ntop(socket.AF_INET6, binascii.unhexlify('%032x' % ip_num))

通过这两个方法可以实现IP两种表达形式间的转换,具体原理可以查看维基百科。以3733117184L为例:

>>> long2ip(3733117184L)
'222.130.217.0'

反过来也可以通过IP地址查到其所在的位置,以百度的115.239.210.27为例:

>>> ip2long('115.239.210.27')
1945096731

查询对应的地理数据可以得到这个地址位于104117,"CN","02","Jinhua","",29.1068,119.6442,,这条记录上,目测是浙江金华?