pyspark: keep a function in the lambda expression

Question

I have the following working code:

def replaceNone(row):
  myList = []
  row_len = len(row)
  for i in range(0, row_len):
    if row[i] is None:
      myList.append("")
    else:
      myList.append(row[i])
  return myList

rdd_out = rdd_in.map(lambda row : replaceNone(row))

Here row is from pyspark.sql import Row

However, it is kind of lengthy and ugly. Is it possible to avoid making the replaceNone function by writing everything in the lambda process directly? Or at least simplify replaceNone()? Thanks!

generally I'd say it sounds like you want your [code reviewed](http://codereview.stackexchange.com) but this is easily solved with a simple [ternary expression](http://stackoverflow.com/questions/394809/does-python-have-a-ternary-conditional-operator) and list comprehension. — Tadhg McDonald-Jensen, Jun 10 '16 at 20:57
... what does that have to do with anything I said or mentioned in the answer you have received? — Tadhg McDonald-Jensen, Jun 10 '16 at 21:06

Erik · Accepted Answer · 2016-06-10T20:57:01.447

1

I'm not sure what your goal is. It seems like you're jsut trying to replace all the None values in each row in rdd_in with empty strings, in which case you can use a list comprehension:

rdd_out = rdd_in.map(lambda row: [r if r is not None else "" for r in row])

The first call to map will make a new list for every element in row and the list comprehension will replace all Nones with empty strings.

This worked on a trivial example (and defined map since it's not defined for a list):

def map(l, f):
    return [f(r) for r in l]

l = [[1,None,2],[3,4,None],[None,5,6]]
l2 = map(l, lambda row: [i if i is not None  else "" for i in row])

print(l2)
>>> [[1, '', 2], [3, 4, ''], ['', 5, 6]]

edited Jun 10 '16 at 20:57

answered Jun 10 '16 at 20:43

Erik

132
1
7

1

what do you mean "map is not defined for a list"? the built in `map` takes the callable as the first argument and the sequence as the second, just swap the order of the arguments and you don't need to redefine it. – Tadhg McDonald-Jensen Jun 10 '16 at 20:59
It is a bit tricky as the Row element can not be reassigned, row[i] = "" won't work. – Edamame Jun 10 '16 at 21:02
@TadhgMcDonald-Jensen I forgot that there was a built-in general map. Since you can't call `[1,2,3].map()` I (stupidly) just made a new function. – Erik Jun 10 '16 at 21:03
@zephyr1999: how would I use the your map function on rdd_in? Thanks! – Edamame Jun 10 '16 at 21:15
1

@Edamame you don't need to. I only used it because I didn't want to go through the effort of setting up an RDD for testing. You should be able to use the `rdd_out = ...` line of code in my answer in place of your entire code block. – Erik Jun 10 '16 at 21:18
@Edamame you may want to look into how [rows are stored in pyspark](https://spark.apache.org/docs/1.6.1/api/python/pyspark.sql.html#pyspark.sql.Row) for efficiency. I don't have much experience in that specifically, so unless the RDDs automatically convert lists, you may be loosing some of the in-memory advantages by converting to a list, as my code does. – Erik Jun 12 '16 at 04:43

pyspark: keep a function in the lambda expression

1 Answers1