当前位置：文江博客话题详情

如何使用 Avro 二进制编码器对 Kafka 消息进行编码/解码？

发布于 2024-12-18 12:47:54 字数 186 浏览 1 评论 0原文

我正在尝试使用 Avro 来读取/写入 Kafka 的消息。有谁有使用 Avro 二进制编码器对将放入消息队列的数据进行编码/解码的示例吗？

与 Kafka 部分相比，我更需要 Avro 部分。或者，也许我应该考虑不同的解决方案？基本上，我试图找到一个在空间方面更有效的 JSON 解决方案。刚刚提到 Avro 是因为它比 JSON 更紧凑。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

吃兔兔 2024-12-25 12:47:54

这是一个基本示例。我还没有尝试过多个分区/主题。

//示例生产者代码

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.*;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.binary.Hex;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;


public class ProducerTest {

    void producer(Schema schema) throws IOException {

        Properties props = new Properties();
        props.put("metadata.broker.list", "0:9092");
        props.put("serializer.class", "kafka.serializer.DefaultEncoder");
        props.put("request.required.acks", "1");
        ProducerConfig config = new ProducerConfig(props);
        Producer<String, byte[]> producer = new Producer<String, byte[]>(config);
        GenericRecord payload1 = new GenericData.Record(schema);
        //Step2 : Put data in that genericrecord object
        payload1.put("desc", "'testdata'");
        //payload1.put("name", "अasa");
        payload1.put("name", "dbevent1");
        payload1.put("id", 111);
        System.out.println("Original Message : "+ payload1);
        //Step3 : Serialize the object to a bytearray
        DatumWriter<GenericRecord>writer = new SpecificDatumWriter<GenericRecord>(schema);
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
        writer.write(payload1, encoder);
        encoder.flush();
        out.close();

        byte[] serializedBytes = out.toByteArray();
        System.out.println("Sending message in bytes : " + serializedBytes);
        //String serializedHex = Hex.encodeHexString(serializedBytes);
        //System.out.println("Serialized Hex String : " + serializedHex);
        KeyedMessage<String, byte[]> message = new KeyedMessage<String, byte[]>("page_views", serializedBytes);
        producer.send(message);
        producer.close();

    }


    public static void main(String[] args) throws IOException, DecoderException {
        ProducerTest test = new ProducerTest();
        Schema schema = new Schema.Parser().parse(new File("src/test_schema.avsc"));
        test.producer(schema);
    }
}

//示例消费者代码

第 1 部分：消费者组代码：因为您可以为多个分区/主题拥有多个消费者。

import kafka.consumer.ConsumerConfig;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

/**
 * Created by  on 9/1/15.
 */
public class ConsumerGroupExample {
   private final ConsumerConnector consumer;
   private final String topic;
   private ExecutorService executor;

   public ConsumerGroupExample(String a_zookeeper, String a_groupId, String a_topic){
      consumer = kafka.consumer.Consumer.createJavaConsumerConnector(
              createConsumerConfig(a_zookeeper, a_groupId));
      this.topic = a_topic;
   }

   private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId){
       Properties props = new Properties();
       props.put("zookeeper.connect", a_zookeeper);
       props.put("group.id", a_groupId);
       props.put("zookeeper.session.timeout.ms", "400");
       props.put("zookeeper.sync.time.ms", "200");
       props.put("auto.commit.interval.ms", "1000");

       return new ConsumerConfig(props);
   }

    public void shutdown(){
         if (consumer!=null) consumer.shutdown();
        if (executor!=null) executor.shutdown();
        System.out.println("Timed out waiting for consumer threads to shut down, exiting uncleanly");
        try{
          if(!executor.awaitTermination(5000, TimeUnit.MILLISECONDS)){

          }
        }catch(InterruptedException e){
            System.out.println("Interrupted");
        }

    }


    public void run(int a_numThreads){
        //Make a map of topic as key and no. of threads for that topic
        Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
        topicCountMap.put(topic, new Integer(a_numThreads));
        //Create message streams for each topic
        Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
        List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);

        //initialize thread pool
        executor = Executors.newFixedThreadPool(a_numThreads);
        //start consuming from thread
        int threadNumber = 0;
        for (final KafkaStream stream : streams) {
            executor.submit(new ConsumerTest(stream, threadNumber));
            threadNumber++;
        }
    }
    public static void main(String[] args) {
        String zooKeeper = args[0];
        String groupId = args[1];
        String topic = args[2];
        int threads = Integer.parseInt(args[3]);

        ConsumerGroupExample example = new ConsumerGroupExample(zooKeeper, groupId, topic);
        example.run(threads);

        try {
            Thread.sleep(10000);
        } catch (InterruptedException ie) {

        }
        example.shutdown();
    }


}

第 2 部分：实际消费消息的个人消费者。

import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.message.MessageAndMetadata;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.IndexedRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.commons.codec.binary.Hex;

import java.io.File;
import java.io.IOException;

public class ConsumerTest implements Runnable{

    private KafkaStream m_stream;
    private int m_threadNumber;

    public ConsumerTest(KafkaStream a_stream, int a_threadNumber) {
        m_threadNumber = a_threadNumber;
        m_stream = a_stream;
    }

    public void run(){
        ConsumerIterator<byte[], byte[]>it = m_stream.iterator();
        while(it.hasNext())
        {
            try {
                //System.out.println("Encoded Message received : " + message_received);
                //byte[] input = Hex.decodeHex(it.next().message().toString().toCharArray());
                //System.out.println("Deserializied Byte array : " + input);
                byte[] received_message = it.next().message();
                System.out.println(received_message);
                Schema schema = null;
                schema = new Schema.Parser().parse(new File("src/test_schema.avsc"));
                DatumReader<GenericRecord> reader = new SpecificDatumReader<GenericRecord>(schema);
                Decoder decoder = DecoderFactory.get().binaryDecoder(received_message, null);
                GenericRecord payload2 = null;
                payload2 = reader.read(null, decoder);
                System.out.println("Message received : " + payload2);
            }catch (Exception e) {
                e.printStackTrace();
                System.out.println(e);
            }
        }

    }


}

测试 AVRO 模式：

{
    "namespace": "xyz.test",
     "type": "record",
     "name": "payload",
     "fields":[
         {
            "name": "name", "type": "string"
         },
         {
            "name": "id",  "type": ["int", "null"]
         },
         {
            "name": "desc", "type": ["string", "null"]
         }
     ]
}

需要注意的重要事项是：

您需要标准的 kafka 和 avro jar 才能开箱即用地运行此代码。
非常重要 props.put("serializer.class", "kafka.serializer.DefaultEncoder");
不要使用 stringEncoder，因为如果您将字节数组作为消息发送，它将无法工作。
您可以将 byte[] 转换为十六进制字符串并将其发送，然后在消费者上将十六进制字符串重新转换为 byte[]，然后转换为原始消息。
按照此处提到的方式运行动物园管理员和代理：- http://kafka.apache.org/documentation.html#快速入门并创建一个名为“page_views”的主题或任何您想要的内容。
运行 ProducerTest.java，然后运行 ConsumerGroupExample.java，查看正在生成和使用的 avro 数据。

This is a basic example. I have not tried it with multiple partitions/topics.

//Sample producer code

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.io.*;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.commons.codec.DecoderException;
import org.apache.commons.codec.binary.Hex;
import kafka.javaapi.producer.Producer;
import kafka.producer.KeyedMessage;
import kafka.producer.ProducerConfig;
import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Properties;


public class ProducerTest {

    void producer(Schema schema) throws IOException {

        Properties props = new Properties();
        props.put("metadata.broker.list", "0:9092");
        props.put("serializer.class", "kafka.serializer.DefaultEncoder");
        props.put("request.required.acks", "1");
        ProducerConfig config = new ProducerConfig(props);
        Producer<String, byte[]> producer = new Producer<String, byte[]>(config);
        GenericRecord payload1 = new GenericData.Record(schema);
        //Step2 : Put data in that genericrecord object
        payload1.put("desc", "'testdata'");
        //payload1.put("name", "अasa");
        payload1.put("name", "dbevent1");
        payload1.put("id", 111);
        System.out.println("Original Message : "+ payload1);
        //Step3 : Serialize the object to a bytearray
        DatumWriter<GenericRecord>writer = new SpecificDatumWriter<GenericRecord>(schema);
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        BinaryEncoder encoder = EncoderFactory.get().binaryEncoder(out, null);
        writer.write(payload1, encoder);
        encoder.flush();
        out.close();

        byte[] serializedBytes = out.toByteArray();
        System.out.println("Sending message in bytes : " + serializedBytes);
        //String serializedHex = Hex.encodeHexString(serializedBytes);
        //System.out.println("Serialized Hex String : " + serializedHex);
        KeyedMessage<String, byte[]> message = new KeyedMessage<String, byte[]>("page_views", serializedBytes);
        producer.send(message);
        producer.close();

    }


    public static void main(String[] args) throws IOException, DecoderException {
        ProducerTest test = new ProducerTest();
        Schema schema = new Schema.Parser().parse(new File("src/test_schema.avsc"));
        test.producer(schema);
    }
}

//Sample consumer code

Part 1 : Consumer group code : as you can have more than multiple consumers for multiple partitions/ topics.

import kafka.consumer.ConsumerConfig;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;

import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.Executor;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

/**
 * Created by  on 9/1/15.
 */
public class ConsumerGroupExample {
   private final ConsumerConnector consumer;
   private final String topic;
   private ExecutorService executor;

   public ConsumerGroupExample(String a_zookeeper, String a_groupId, String a_topic){
      consumer = kafka.consumer.Consumer.createJavaConsumerConnector(
              createConsumerConfig(a_zookeeper, a_groupId));
      this.topic = a_topic;
   }

   private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId){
       Properties props = new Properties();
       props.put("zookeeper.connect", a_zookeeper);
       props.put("group.id", a_groupId);
       props.put("zookeeper.session.timeout.ms", "400");
       props.put("zookeeper.sync.time.ms", "200");
       props.put("auto.commit.interval.ms", "1000");

       return new ConsumerConfig(props);
   }

    public void shutdown(){
         if (consumer!=null) consumer.shutdown();
        if (executor!=null) executor.shutdown();
        System.out.println("Timed out waiting for consumer threads to shut down, exiting uncleanly");
        try{
          if(!executor.awaitTermination(5000, TimeUnit.MILLISECONDS)){

          }
        }catch(InterruptedException e){
            System.out.println("Interrupted");
        }

    }


    public void run(int a_numThreads){
        //Make a map of topic as key and no. of threads for that topic
        Map<String, Integer> topicCountMap = new HashMap<String, Integer>();
        topicCountMap.put(topic, new Integer(a_numThreads));
        //Create message streams for each topic
        Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer.createMessageStreams(topicCountMap);
        List<KafkaStream<byte[], byte[]>> streams = consumerMap.get(topic);

        //initialize thread pool
        executor = Executors.newFixedThreadPool(a_numThreads);
        //start consuming from thread
        int threadNumber = 0;
        for (final KafkaStream stream : streams) {
            executor.submit(new ConsumerTest(stream, threadNumber));
            threadNumber++;
        }
    }
    public static void main(String[] args) {
        String zooKeeper = args[0];
        String groupId = args[1];
        String topic = args[2];
        int threads = Integer.parseInt(args[3]);

        ConsumerGroupExample example = new ConsumerGroupExample(zooKeeper, groupId, topic);
        example.run(threads);

        try {
            Thread.sleep(10000);
        } catch (InterruptedException ie) {

        }
        example.shutdown();
    }


}

Part 2 : Indiviual consumer that actually consumes the messages.

import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.message.MessageAndMetadata;
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericRecord;
import org.apache.avro.generic.IndexedRecord;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.Decoder;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.commons.codec.binary.Hex;

import java.io.File;
import java.io.IOException;

public class ConsumerTest implements Runnable{

    private KafkaStream m_stream;
    private int m_threadNumber;

    public ConsumerTest(KafkaStream a_stream, int a_threadNumber) {
        m_threadNumber = a_threadNumber;
        m_stream = a_stream;
    }

    public void run(){
        ConsumerIterator<byte[], byte[]>it = m_stream.iterator();
        while(it.hasNext())
        {
            try {
                //System.out.println("Encoded Message received : " + message_received);
                //byte[] input = Hex.decodeHex(it.next().message().toString().toCharArray());
                //System.out.println("Deserializied Byte array : " + input);
                byte[] received_message = it.next().message();
                System.out.println(received_message);
                Schema schema = null;
                schema = new Schema.Parser().parse(new File("src/test_schema.avsc"));
                DatumReader<GenericRecord> reader = new SpecificDatumReader<GenericRecord>(schema);
                Decoder decoder = DecoderFactory.get().binaryDecoder(received_message, null);
                GenericRecord payload2 = null;
                payload2 = reader.read(null, decoder);
                System.out.println("Message received : " + payload2);
            }catch (Exception e) {
                e.printStackTrace();
                System.out.println(e);
            }
        }

    }


}

Test AVRO schema :

{
    "namespace": "xyz.test",
     "type": "record",
     "name": "payload",
     "fields":[
         {
            "name": "name", "type": "string"
         },
         {
            "name": "id",  "type": ["int", "null"]
         },
         {
            "name": "desc", "type": ["string", "null"]
         }
     ]
}

Important things to note are :

Youll need the standard kafka and avro jars to run this code out of the box.
Is very important props.put("serializer.class", "kafka.serializer.DefaultEncoder");
Dont use stringEncoder as that wont work if you are sending a byte array as message.
You can convert the byte[] to a hex string and send that and on the consumer reconvert hex string to byte[] and then to the original message.
Run the zookeeper and the broker as mentioned here :- http://kafka.apache.org/documentation.html#quickstart and create a topic called "page_views" or whatever you want.
Run the ProducerTest.java and then the ConsumerGroupExample.java and see the avro data being produced and consumed.

回复收藏 0 原文

烂柯人 2024-12-25 12:47:54

我终于记得询问 Kafka 邮件列表并得到以下答案，效果非常好。

是的，您可以将消息作为字节数组发送。如果你看一下构造函数
Message 类的，你会看到 -
def this(字节:数组[字节])
现在，看看 Producer send() API -
def send(生产者数据：生产者数据[K,V]*)
您可以将 V 设置为 Message 类型，将 K 设置为您想要的密钥类型。
如果您不关心使用密钥进行分区，请将其设置为 Message
也键入。
谢谢，
内哈

回复收藏 0 原文

少跟Wǒ拽 2024-12-25 12:47:54

如果你想从 Avro 消息中获取字节数组（kafka 部分已经回答），请使用二进制编码器：

    GenericDatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); 
    ByteArrayOutputStream os = new ByteArrayOutputStream(); 
    try {
        Encoder e = EncoderFactory.get().binaryEncoder(os, null); 
        writer.write(record, e); 
        e.flush(); 
        byte[] byteData = os.toByteArray(); 
    } finally {
        os.close(); 
    }

If you want to get a byte array from an Avro message (the kafka part is already answered), use the binary encoder:

    GenericDatumWriter<GenericRecord> writer = new GenericDatumWriter<GenericRecord>(schema); 
    ByteArrayOutputStream os = new ByteArrayOutputStream(); 
    try {
        Encoder e = EncoderFactory.get().binaryEncoder(os, null); 
        writer.write(record, e); 
        e.flush(); 
        byte[] byteData = os.toByteArray(); 
    } finally {
        os.close(); 
    }

回复收藏 0 原文

不再让梦枯萎 2024-12-25 12:47:54

除了 Avro，您还可以简单地考虑压缩数据；要么使用 gzip（压缩效果好，CPU 更高），要么使用 LZF 或 Snappy（压缩速度更快，但压缩速度稍慢）。

或者还有 Smile 二进制 JSON，由 Jackson 在 Java 中支持（带有这个扩展)：它是紧凑的二进制格式，比 Avro 更容易使用：

ObjectMapper mapper = new ObjectMapper(new SmileFactory());
byte[] serialized = mapper.writeValueAsBytes(pojo);
// or back
SomeType pojo = mapper.readValue(serialized, SomeType.class);

基本上与 JSON 相同的代码，除了用于传递不同的格式工厂。
从数据大小的角度来看，Smile 还是 Avro 更紧凑取决于用例的细节；但两者都比 JSON 更紧凑。

这样做的好处是，它可以快速使用 JSON 和 Smile，使用相同的代码，仅使用 POJO。与 Avro 相比，Avro 要么需要代码生成，要么需要大量手动代码来打包和解包 GenericRecord。

Instead of Avro, you could also simply consider compressing data; either with gzip (good compression, higher cpu) or LZF or Snappy (much faster, bit slower compression).

Or alternatively there is also Smile binary JSON, supported in Java by Jackson (with this extension): it is compact binary format, and much easier to use than Avro:

ObjectMapper mapper = new ObjectMapper(new SmileFactory());
byte[] serialized = mapper.writeValueAsBytes(pojo);
// or back
SomeType pojo = mapper.readValue(serialized, SomeType.class);

basically same code as with JSON, except for passing different format factory.
From data size perspective, whether Smile or Avro is more compact depends on details of use case; but both are more compact than JSON.

Benefit there is that this works fast with both JSON and Smile, with same code, using just POJOs. Compared to Avro which either requires code generation, or lots of manual code to pack and unpack GenericRecords.

回复收藏 0 原文

悲欢浪云 2024-12-25 12:47:54

更新了答案。

Kafka 有一个带有 Maven（SBT 格式）坐标的 Avro 序列化器/反序列化器：

  "io.confluent" % "kafka-avro-serializer" % "3.0.0"

您将 KafkaAvroSerializer 的实例传递到 KafkaProducer 构造函数中。

然后，您可以创建 Avro GenericRecord 实例，并将它们用作 Kafka ProducerRecord 实例中的值，您可以使用 KafkaProducer 发送这些值。

在 Kafka 消费者端，您使用 KafkaAvroDeserializer 和 KafkaConsumer。

Updated Answer.

Kafka has an Avro serializer/deserializer with Maven (SBT formatted) coordinates:

  "io.confluent" % "kafka-avro-serializer" % "3.0.0"

You pass an instance of KafkaAvroSerializer into the KafkaProducer constructor.

Then you can create Avro GenericRecord instances, and use those as values inside Kafka ProducerRecord instances which you can send with KafkaProducer.

On the Kafka consumer side, you use KafkaAvroDeserializer and KafkaConsumer.

回复收藏 0 原文

~没有更多了~