Get ID from slave server
Context
When processing large amounts of data on multiple distinct systems (for example when running a clustered pipeline) you sometimes still need to generate a unique numeric ID for each one. Since a GUID is not guaranteed to be unique and consumes a lot more memory and space compared to a numeric integer (long) ID, you may prefer to use numerical IDs.
A unique ID identifies a piece of information, which is useful for: looking up data, support cases, incremental processing, error handling (recovering from where you left off by looking up the last inserted ID) and so on. Typically you would use a database to generate such an ID, for example using an Oracle sequence. However, there are cases where you don’t have a database available (such as when you add a unique ID to a text/log file), when the database doesn’t support sequences (column databases, typically), or when retrieving the ID from a database is slow (going back and forth to the database for each row severely limits performance). In all these situations, you need a high-performance way of providing unique IDs to a Hop pipeline. The Get ID From Slave Server transform is designed to fit this need.
Assuming you have (or are now configuring) Hop Server slave servers set up in Hop, this transform retrieves a unique ID from the Hop Server slave server of your choice. It works by asking a slave server for a range of values. In other words, the transform reserves a range of IDs for a certain given central slave server sequence. It then increments by one until there are no more values in the range, asks another range, and so forth. This means that the returned IDs can be sequential, however this is not guaranteed. The only thing you know is that after each execution for each input row you get a unique ID that has a higher value than the last one returned. The last value returned plus the size of the range (increment) are stored in a database table to persist the value over several runs of a pipeline.
Options
Option | Description |
---|---|
Transform name | The name of this transform as it appears in the pipeline workspace. |
Name of value | The name of the (Integer type) output field (sequence or ID) |
Slave server | The hop server to get the unique ID range from. This can be specified using a variable |
Sequence name | The name of the sequence as defined on the hop server (see below). The sequence name can be specified using a variable |
Increment or batch name | The number of IDs to generate before a new value (range) is obtained from the hop server. The higher you make the increment size, the faster the transform can run. However, it will deplete the available IDs (1x10^15) faster. For example, if you take 1,000,000,000 as the increment, you can reserve 1,000,000 times a range of IDs. If you only use up a few of those ID each time a pipeline runs, you will run out of IDs anyway after 1M executions. Don’t make this value too high; keep it in the 1000-10000 range to achieve speeds of over 100,000 rows/sec with this transform. This parameter can be specified using a variable |
Slave server configuration
You need to start your hop server with extra connection and sequence information configured in an XML file. The extra connection and sequences blocks are what make the sequences work in the following example:
<hop-server-config>
<hop-server>
<name>master1</name>
<hostname>localhost</hostname>
<port>8282</port>
<master>Y</master>
</hop-server>
<connection>
<name>MySQL</name>
<server>localhost</server>
<type>MYSQL</type>
<access>Native</access>
<database>test</database>
<port>3306</port>
<username>matt</username>
<password>Encrypted 2be98afc86aa7f2e4cb79ce10df90acde</password>
</connection>
<sequences>
<sequence>
<name>test</name>
<start>0</start>
<connection>MySQL</connection>
<schema/>
<table>SEQ_TABLE</table>
<sequence_field>SEQ_NAME</sequence_field>
<value_field>SEQ_VALUE</value_field>
</sequence>
</sequences>
The start tag is optional and will default to 0 if you leave it out of the definition. You can define as many sequences as you like.
Servlet information
Once the configuration files are changed as shown above, slave servers receive a new servlet that can be called as follows (authentication required):http://hostname:port/hop/nextSequence/?name=SequenceName&increment=NumberOfIDsToReserveIn case no increment is specified, 10000 IDs are reserved, for example:http://localhost:8282/hop/nextSequence/?name=test The servlet will return a simple piece of XML containing both the start of the range as well as the number of IDs reserved, or the increment:
<seq><value>570000</value><increment>10000</increment></seq>
Continuing with this example, the following row will be present in the SEQ_TABLE table:
mysql> select * from SEQ_TABLE where SEQ_NAME='test';
SEQ_NAME | SEQ_VALUE |
---|---|
test | 580000 |
Automatic loading and creation
It can be a burden to maintain all your sequences in an XML file. Because of this, it is also possible to automatically load all the sequences from a database table. You can use the following construct to do it:
<autosequence>
<connection>MySQL</connection>
<schema/>
<start>1234</start>
<table>SEQ_TABLE</table>
<sequence_field>SEQ_NAME</sequence_field>
<value_field>SEQ_VALUE</value_field>
<autocreate>N</autocreate>
</autosequence>
The <autocreate> tag allows any sequence name to be specified in the transform without error. In that case, the sequence with the name specified will be created automatically with the start value from the <autosequence> specification.