2021-01-11 04:34

embedding constraints

Does UMAP currently support specifying any embedding constraints?

In the past I've found it useful when doing dimensionality reduction to specify targets for a few key inputs. For example, if embedding a UMAP of RGB colors on to 2D one can make the output more predictable by specifying that the darkest input go to the left and the lightest to the right. Currently I can get the same effect by rotating and scaling the UMAP embedding after the fact (which can be automated), but a more direct and flexible way would be to introduce an ability in the API to pin specific inputs to target destinations in the embedding.


  • 点赞
  • 写回答
  • 关注问题
  • 收藏
  • 复制链接分享
  • 邀请回答


  • weixin_39746552 weixin_39746552 4月前

    Great to hear you thinking along similar lines. No, I don't have any better implementation - just was trying to communicate that use case more concretely.

    There is another use case I come across frequently: grid layout. Here's a prototypical use case: flag layout by color:


    My current codebase for this first generates a 2D umap scatterplot:


    But then does a followup Linear Assignment Problem solver step to move points to the nearest grid location.


    Assuming umap doesn't currently have a means for doing discrete layouts like this, we could potentially discuss whether this use case would also be a possible form of an allowable embedding constraint.

    点赞 评论 复制链接分享
  • weixin_39710966 weixin_39710966 4月前

    Embedding constraint-wise that's certainly a little more complicated to implement directly. It certainly isn't impossible, but would require a different approach. I'll certainly keep it in mind however.

    点赞 评论 复制链接分享
  • weixin_39746552 weixin_39746552 4月前

    Thanks Leland - encouraged that this idea is not DOA let me motivate this a bit more providing a sample use case, proposed API, and baseline implementation.

    One use case is when there really is an approximate low dimensional embedding lurking in the higher dimensional data. Here one could attempt to make a more canonical view which could be compared across different datasets, etc. For example, suppose the data was approximately a 2D doughnut embedded in 3D space - in my motivating case I'll make this a (noisy) HSV colorwheel within an RGB cube:

            h = np.random.random()
            s = np.random.uniform(0.5, 1.0)
            v = np.random.uniform(0.8, 0.9)
            data[n] = colorsys.hsv_to_rgb(h, s, v)

    UMAP generally has no problem recovering the structure of this data:


    However, when running it multiple times with different seeds for generating the data the UMAP "view" of the colorwheel slice will be arbitrarily rotated, translated, and even flipped. To save space I'll animate results across 3 runs:


    One simple and flexible solution to constraining the embedding would be the ability to specify "pinned data" to UMAP.fit() which constrains certain inputs to be assigned destination locations in the embedded space. In my example I'm going to find the most extreme points on each axis (ie: the index locations for the reddest, greenest, and bluest values) and assign them target points in a simple lookup table:

        pinned_data = [[max_red, [0, 3]], [max_green, [6, 10]], [max_blue, [10, 0]]]

    So first here's a plot showing just these three pinned points:


    And if we pass these three destinations to UMAP.fit() we can get an embedding is constrained to not move these three selected inputs:

        u = fit.fit_transform(data, pinned_data=pinned_data)

    The result is that even when we change the seed to generate three different random datasets, the results are much easier to compare against each other.


    You could even imagine aggregating these results across multiple data sources by combining the embedded outputs or even just overlaying the graphs directly with some transparency.

    Now my current "implementation" is super brain-dead. All I do is repeatedly force-move each pinned point in the embedded space each epoch before calling the optimize_fn.

        for n in range(n_epochs):
            if pinned_data is not None:
                for p in pinned_data:
                    head_embedding[p[0]] = p[1]

    So this version currently only works by being generous with n_neighbors, n_epochs and cherry picking a bit. But it communicates the idea - so if we are not put off too much by my API decisions around this and are willing to push this further, I can take a shot making this more general by also having the pinned_data influence the initial embedding which I think would make this more general.

    点赞 评论 复制链接分享
  • weixin_39710966 weixin_39710966 4月前

    Thanks for the explanation. Your approach is not so fr off from what I had in mind. I was actually thinking of making it a little more general and having a mask (boolean vector) of fixed points. Thus in the internal optimization we would simply check if a point is fixed or not before updating/moving it. If done right this should be relatively cheap and not slow things down noticeably. This opens up other possibilities as well, such as making the transform a little cleaner (all the training data will simply be declared fixed).

    The downside of this approach is that we would still need some level of better front-end API on top of that to facilitate your use case (since setting up an initialization and designated fixed points would be a little much to expect of end users in general).

    I would certainly be interested in seeing a PR if you feel you have something semi-ready to go of course.

    点赞 评论 复制链接分享
  • weixin_39710966 weixin_39710966 4月前

    I don't know any way of doing that with the current code, but I agree that it certainly makes a lot of sense. I'll add this as a feature request, and hopefully I can add it as a feature in the near future.

    点赞 评论 复制链接分享