分享

Storm【Storm0.9.3】官方翻译 2: Serialization

xioaxu790 发表于 2014-11-11 20:03:41 [显示全部楼层] 回帖奖励 阅读模式 关闭右栏 0 15977
本帖最后由 xioaxu790 于 2014-11-11 20:04 编辑
问题导读
1、为什么在工业的生产环境之中使用自定义的序列化?
2、什么是Kryo ?





接上一篇:Storm 【最新版 0.9.3】官方翻译 1: Fault-tolerance
This page is about how the serialization system in Storm works for versions 0.6.0 and onwards. Storm used a different serialization system prior to 0.6.0 which is documented on Serialization (prior to 0.6.0).
本章的描述适应于Storm0.6.0 以及以后的一些版本采取了其他的序列化系统。

Tuples can be compressed of objects of any types. Since Storm is a distributed system, it needs to know how to serialize and deserialize objects when they’re passed between tasks.
Tuples能偶被压缩成为任意类型的对象,做为分布式的消息处理系统,它需要知道怎么样在Task之间去序列化和反序列化对象。

Storm uses Kryo for serialization. Kryo is a flexible and fast serialization library that produces small serializations.
Storm使用了Kryo作为序列化的手段,Kryo 是一个快速高效的Java对象图形序列化框架,主要特点是性能、高效和易用。该项目用来序列化对象到文件、数据库或者网络。

By default, Storm can serialize primitive types, strings, byte arrays, ArrayList, HashMap, HashSet, and the Clojure collection types. If you want to use another type in your tuples, you’ll need to register a custom serializer.
在默认的情况之下,Storm能够序列化以下的基本类型:String,字节数组,ArrayList,HashMap,HashSet,还包括了一些Clojure的集合类型,如果你想在Tuple的传递过程之中使用其他的类型,那么你需要手工的去注册一个Serializer。


Dynamic typing
动态的类型


对于元祖中的Field,是没有任何的类型声明的,你只需要向里面填充,并且Storm会自动的解析并且动态的序列化,在我们拿到这个序列化的接口之前,让我们在理解为什么Storn‘s tuples是一个动态的类型之上,在展开来谈。

There are no type declarations for fields in a Tuple. You put objects in fields and Storm figures out the serialization dynamically. Before we get to the interface for serialization, let’s spend a moment understanding why Storm’s tuples are dynamically typed.

一旦为tule的fileds增加了静态的类型,将会给Storm的API带来更多的复杂性,静态的类型她自身的keys和Values将会要有有大量的Annotation

Adding static typing to tuple fields would add large amount of complexity to Storm’s API. Hadoop, for example, statically types its keys and values but requires a huge amount of annotations on the part of the user. Hadoop’s API is a burden to use and the “type safety” isn’t worth it. Dynamic typing is simply easier to use.

除此以外,无论在哪个方面还没有找到一个合适的理由是这么做。支持一个Bolt订阅多个Stream,这些从不同流过来的Strems也可能有不同的类型,每单一个Bolt接收到一个Tuple在Execute方法,这个Tuple可能来自于任意的流,应此也可能有任意任意的类型,这可能在通过在每一个tuple Stream 设置特别的Reflection magic来区分,但是最简单,最直观的办法就死一个动态Dynamic 的类型。


Further than that, it’s not possible to statically type Storm’s tuples in any reasonable way. Suppose a Bolt subscribes to multiple streams. The tuples from all those streams may have different types across the fields. When a Bolt receives a Tuple inexecute, that tuple could have come from any stream and so could have any combination of types. There might be some reflection magic you can do to declare a different method for every tuple stream a bolt subscribes to, but Storm opts for the simpler, straightforward approach of dynamic typing.

Finally, another reason for using dynamic typing is so Storm can be used in a straightforward manner from dynamically typed languages like Clojure and JRuby.


自定义序列化
Custom serialization


通常而言,Storm使用了Kryo的序列化,为了去实现自定义的序列化,你需要通过keyo去注册一个新的Serializer,如果有必要,请直接去访问Kryo'home page去明白怎杨去处理定制化。

As mentioned, Storm uses Kryo for serialization. To implement custom serializers, you need to register new serializers with Kryo. It’s highly recommended that you read over Kryo’s home page to understand how it handles custom serialization.


增加一个定制化的Serialzeers,你需要在你的topology配置之中新增。
Adding custom serializers is done through the “topology.kryo.register” property in your topology

他将花销一系列的注册器,每一个registraction能够采取以下的两种形式
config. It takes a list of registrations, where each registration can take one of two forms:

The name of a class to register. In this case, Storm will use Kryo’sFieldsSerializer to serialize the class. This may or may not be optimal for the class – see the Kryo docs for more details.


A map from the name of a class to register to an implementation ofcom.esotericsoftware.kryo.Serializer.

Let’s look at an example.

具体的释放如下

topology.kryo.register: - com.mycompany.CustomType1 - com.mycompany.CustomType2: com.mycompany.serializer.CustomType2Serializer - com.mycompany.CustomType3

com.mycompany.CustomType1 and com.mycompany.CustomType3 will use the FieldsSerializer, whereascom.mycompany.CustomType2 will use com.mycompany.serializer.CustomType2Serializer for serialization.



Storm也同样的提供了在topology Config之中注册序列化的helper,这个配置类有一个方法就叫registerSreializaton

Storm provides helpers for registering serializers in a topology config. The Configclasshas a method called registerSerialization that takes in a registration to add to the config.

相比之下,还会有更高级的Config配置被称为:Config.TOPOLOGY_SKIP_MISSING_KRYO_REGISTRATIONS.,如果你把他摄制为true,Storn将会自动的ignore任何已经注册但是布恩那个在ClassPath之中找到代码的序列化

There’s an advanced config called Config.TOPOLOGY_SKIP_MISSING_KRYO_REGISTRATIONS. If you set this to true, Storm will ignore any serializations that are registered but do not have their code available on the classpath.



//另外一方面,Storm将会抛出一个Error,每当你找不到这个Serialzation的时候,这是非常有必要的,如果你运许多topologirs在一个集群之中,并且每一个都有自身的序列化系统,那么你就必须在把每一个使用的序列化都配置到Storm.yarml文件之中

Otherwise, Storm will throw errors when it can’t find a serialization. This is useful if you run many topologies on a cluster that each have different serializations, but you want to declare all the serializations across all topologies in the storm.yaml files.

Java serialization

如果Storm遇到了一种类型,并且这个类型还没有被注册到Storm之中,他将会使用java的序列化,如果可能,如果连java的序列化都不能正常使用,那么Storm就会抛出一个异常。

If Storm encounters a type for which it doesn’t have a serialization registered, it will use Java serialization if possible. If the object can’t be serialized with Java serialization, then Storm will throw an error.


在这里必须清醒的意识到,java的序列化的开销是非常昂贵的。不管是在CPU的开销,还是序列化的对象的大小之上,强烈的建议你使用自定义的序列化在工业的生产环境之中

Beware that Java serialization is extremely expensive, both in terms of CPU cost as well as the size of the serialized object. It is highly recommended that you register custom serializers when you put the topology in production. The Java serialization behavior is there so that it’s easy to prototype new topologies.


You can turn off the behavior to fall back on Java serialization by setting theConfig.TOPOLOGY_FALL_BACK_ON_JAVA_SERIALIZATION config to false.

Component-specific serialization registrations

Storm 0.7.0 lets you set component-specific configurations (read more about this at Configuration). Of course, if one component defines a serialization that serialization will need to be available to other bolts – otherwise they won’t be able to receive messages from that component!

When a topology is submitted, a single set of serializations is chosen to be used by all components in the topology for sending messages. This is done by merging the component-specific serializer registrations with the regular set of serialization registrations. If two components define serializers for the same class, one of the serializers is chosen arbitrarily.

To force a serializer for a particular class if there’s a conflict between two component-specific registrations, just define the serializer you want to use in the topology-specific configuration. The topology-specific configuration has precedence over component-specific configurations for serialization registrations.


没找到任何评论,期待你打破沉寂

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

关闭

推荐上一条 /2 下一条