I belive you are not using a proper microbenchmark setting. You are comparing the warmup of the bytecode instrumentation framework (ASM which is used to generated the lambda bytecode at runtime) + lambda execution time with the execution time of the loop.
Check this answer for performance-difference-between-java-8-lambdas-and-anonymous-inner-classes and the linked document. The linked document has a deep insight about the processing under the hood.
edit To provide a small snippet to demonstrate the above.
public class Warmup {
    static int dummy;
    static void merge(String s) {
        dummy += s.length();
        dummy++;
        dummy -= s.length();
    }
    public static void main(String[] args) throws IOException {
        List<String> list1 = new ArrayList<>();
        Random rand = new Random(1);
        for (int i = 0; i < 100_000; i++) {
            list1.add(Long.toString(rand.nextLong()));
        }
        // this will boostrap the bytecode instrumentation
        // Stream.of("foo".toCharArray()).forEach(System.out::println);
        long start = System.nanoTime();
        list1.forEach(data -> merge(data));
        long end = System.nanoTime();
        System.out.printf("duration: %d%n", end - start);
        System.out.println(dummy);
    }
}
If you run the code as it is posted the printed duration on my machine is
duration: 71694425
If you uncomment the line Stream.of(... (which is only there to use the the bytecode instrumentation framework the first time) the printed duration is 
duration: 7516086
Which is only around 10% of the initial run.
note Only to be explicit. Don't use benchmarks like the above. Have a look at jmh for such a requirement.