PaperNotes/feature hashing.md at master · Oukaishen/PaperNotes

feature hashing

created by kaishen, 10 Jun

Also known as "hash trick". The idea behind this name simple: **convert the data into a vector of features. **

Similar methods are like "one-hot encode", "embedding".

Taking an example like this:

If there is a text "Hi, kaishen is using hashing trick" and we want to convert it to a vector.

First thing is to determine how large the dimension you want in the output vector.

In general, this number can be VERY large, up to 2**25. Let say we use 5 in this example.

Second we can use any good hash function that accept a word and output a number in [0,4].

Let say:

hash(Hi) = 0, hash(kaishen) = 1, hash(is) = 1, hash(using) = 2, hash(hashing) = 3, hash(trick) = 4,

Then the final vector is [1,2,1,1,1]

Notice that we just add 1 to the i-th dimension of ther vector each time our hash function returns that dimension. Sometimes you can use a 2nd hash function that return {-1, +1} to determine whether it is add or substract.

[Pro] There is one advantage for hash trick compared to the "one-hot" method. It is Friendly to online learning method where you can train on a dataset that doesn't fit in memory because you need to see each example only once. One-hot encoding will not work well with online learning beacuse to prepare dictionaries you need to see whole dataset first.

[Con] feature hashing does not have good interpretation.

See the following example:

fh = FeatureHasher(n_features = 5, input_type="string")
train_sample = np.array([["Hi","kaishen","is","using","hashing", "trick"]])
result = fh.transform(train_sample).toarray()
print(result)
## [[ 1.  0.  2. -2. -1.]]

part_sample = np.array([["Hi"]])
a1 = fh.transform(part_sample).toarray()
print(a1)
print()
part_sample = np.array([["kaishen"]])
a2 = fh.transform(part_sample).toarray()
print(a2)
print()
part_sample = np.array([["is"]])
a3 = fh.transform(part_sample).toarray()
print(a3)
print()
part_sample = np.array([["using"]])
a4 = fh.transform(part_sample).toarray()
print(a4)
print()
part_sample = np.array([["hashing"]])
a5 = fh.transform(part_sample).toarray()
print(a5)
print()
part_sample = np.array([["trick"]])
a6 = fh.transform(part_sample).toarray()
print(a6)
print()
result_add = a1+a2+a3+a4+a5+a6
print(result_add)
np.allclose(result, result_add)
######
[[0. 0. 1. 0. 0.]]
[[ 0.  0.  0. -1.  0.]]
[[0. 0. 1. 0. 0.]]
[[1. 0. 0. 0. 0.]]
[[ 0.  0.  0. -1.  0.]]
[[ 0.  0.  0.  0. -1.]]
[[ 1.  0.  2. -2. -1.]]
True

This result is in line with the disscussion above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature hashing

FilesExpand file tree

feature hashing.md

Latest commit

History

feature hashing.md

File metadata and controls

feature hashing