Gdzie praca?setOutputKeyClass i job.setOutputReduceClass odnosi się do?

Question

Gdzie praca?setOutputKeyClass i job.setOutputReduceClass odnosi się do?

Myślałem, że odnoszą się do reduktora, ale w moim programie Mam

public static class MyMapper extends Mapper< LongWritable, Text, Text, Text >

I

public static class MyReducer extends Reducer< Text, Text, NullWritable, Text >

Więc jeśli mam

job.setOutputKeyClass( NullWritable.class );

job.setOutputValueClass( Text.class );

Otrzymuję następujący wyjątek

Type mismatch in key from map: expected org.apache.hadoop.io.NullWritable, recieved org.apache.hadoop.io.Text

Ale jeśli mam

job.setOutputKeyClass( Text.class );

Nie ma żadnego problemu.

Czy sth jest źle z moim kodem lub dzieje się tak z powodu NullWritable lub sth else?

Również czy muszę używać job.setInputFormatClass i job.setOutputFormatClass? Ponieważ moje programy działają poprawnie bez nich.

14

java hadoop mapreduce

Author: Charles Menguy, 2013-01-09

Source

1 answers

score 29 · Accepted Answer

Wywołanie job.setOutputKeyClass( NullWritable.class ); ustawia oczekiwane typy jako wyjście zarówno z fazy map jak i reduce.

Jeśli twój maper emituje inne typy niż reduktor, możesz ustawić typy emitowane przez mapera za pomocą metod JobConf'S setMapOutputKeyClass() i setMapOutputValueClass(). Te domyślnie ustawiają typy wejściowe oczekiwane przez reduktor.

(źródło: Yahoo Developer Tutorial )

Jeśli chodzi o drugie pytanie, domyślnym InputFormat jest TextInputFormat. Traktuje każdą linię KAŻDEGO pliku wejściowego jako oddzielny rekord i nie wykonuje parsowania. Możesz wywołać te metody, jeśli chcesz przetworzyć dane wejściowe w innym formacie, oto kilka przykładów:

InputFormat             | Description                                      | Key                                      | Value
--------------------------------------------------------------------------------------------------------------------------------------------------------
TextInputFormat         | Default format; reads lines of text files        | The byte offset of the line              | The line contents
KeyValueInputFormat     | Parses lines into key, val pairs                 | Everything up to the first tab character | The remainder of the line
SequenceFileInputFormat | A Hadoop-specific high-performance binary format | user-defined                             | user-defined

Domyślną instancją OutputFormat jest TextOutputFormat, która zapisuje pary (klucz, wartość) w poszczególnych liniach pliku tekstowego. Niektóre przykłady poniżej:

OutputFormat             | Description
---------------------------------------------------------------------------------------------------------
TextOutputFormat         | Default; writes lines in "key \t value" form
SequenceFileOutputFormat | Writes binary files suitable for reading into subsequent MapReduce jobs
NullOutputFormat         | Disregards its inputs

(źródło: Other Yahoo Developer Tutorial )