Set-up
I am using Scrapy to scrape housing ads. Per ad I retrieve a postal code which consists of four numbers followed by 2 letters, e.g. 1053ZM.
I have a excel sheet linking districts to postal codes in the following way,
district postcode_min postcode_max
A 1011AB 1011BD
A 1011BG 1011CE
A 1011CH 1011CZ
So, the second row states that postcodes ranging from 1011AB, 1011AC,..., 1011AZ, 1011BA,...,1011BD belong to district A.
The actual list contains 1214 rows.
Problem
I'd like to match each ad with its respective district, using its postal code and the list.
I am not sure what would be the best way to do this, and how to do this.
I've come up with two different approaches:
- Create all postcodes between
postcode_minandpostcode_max, assign all postcodes and their respective districts to a dictionary to subsequently match using a loop.
I.e. create,
d = {'A': ['1011AB','1011AC',...,'1011BD',
'1011BG','1011BH',...,'1011CE',
'1011CH','1011CI',...,'1011CZ'],
'B': [...],
}
and then,
found = False
for distr in d.keys(): # loop over districts
for code in d[distr]: # loop over district's postal codes
if postal_code in code: # assign if ad's postal code in code
district = distr
found = True
break
else:
district = 'unknown'
if found:
break
- Make Python understand there is a range between the
postcode_minand thepostcode_max, assign ranges and their respective districts to a dictionary, and match using a loop.
I.e. something like,
d = {'A': [range(1011AB,1011BD), range(1011BG,1011CE),range(1011CH,1011CZ)],
'B': [...]
}
and then,
found = False
for distr in d.keys(): # loop over districts
for range in d[distr]: # loop over district's ranges
if postal_code in range: # assign if ad's postal code in range
district = distr
found = True
break
else:
district = 'unknown'
if found:
break
Issues
For approach 1:
- How do I create all the postal codes and assign them to a dictionary?
For approach 2:
I used range() for explanatory purpose but I know range() does not work like this.
- What do I need to effectively have a
range()as in the example above? - How do I correctly loop over these ranges?
I think my preference lies with approach 2, but I am happy to work with either one. Or with another solution if you have one.