You can use a set:
- Sets can only contain unique values
- I've used
set(...) to be explicit, but set(...) can be replace with {...}.
- This implementation builds a generator inside
set()
- Don't use a list-comprehension inside (e.g.
set([...])), because the list can potentially use a lot of memory.
- word not in tickets causes
NameError: name 'tickets' is not defined because, from the perspective of the list comprehension, tickets does not exist.
- If you're not getting a
NameError, it's because tickets exists in memory already, or tickets is assigned in your code, but not this example.
- Given the example code, if you clear the environment, and run the code, you'll get an error.
.match returns something like <re.Match object; span=(0, 9), match='PRJ1-2333'> or None
- Where
match = jira_regex.match(t), if there's a match, get the value with match[0].
word for line in f for word in line.split() if jira_regex.match(word) assumes that if jira_regex.match(word) isn't None that the match is always equal to word. Based on the sample data, this is the case, but I don't know if that's the case with the real data.
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
tickets = set(word for line in f for word in line.split() if jira_regex.match(word))
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}
Without .split():
- It seems as if
line.split() is being used to get rid of the newline, which can be accomplished with line.strip()
Option 1:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
tickets = set(jira_regex.match(word.strip())[0] for word in f) # assumes .match will never be None
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}
Option 2:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
with open('test.txt', 'r') as f:
tickets = set(word.strip() for word in f if jira_regex.match(word.strip()))
print(tickets)
{'MISC-5002', 'PRJ1-2333', 'PRJ2-2333'}
For the code to be explicit:
jira_regex = re.compile("^[A-Z][A-Z0-9]+-[0-9]+")
tickets = list()
with open('test.txt', 'r') as f:
for t in f:
t = t.strip() # remove space from beginning and end and remove newlines
match = jira_regex.match(t) # assign .match to a variable
if match != None: # check if a match was found
match = match[0] # extract the match value, depending on the data, this may not be the same as 't'
if match not in tickets: # check if match is in tickets
tickets.append(match) # if match is not in tickets, add it to tickets
print(tickets)
['PRJ1-2333', 'PRJ2-2333', 'MISC-5002']