PHP / MySQL中的地理搜索(距离)(性能)

I have a MySQL-table (MyISAM) containing about 200k entries of lat/long pairs that I select from, based on the pairs distance (great circle formula) from another lat/long pair. (e.g. all entries that are within a 10km radius around 50.281852, 2.504883)

My problem is that this query takes about 0,28 sec. to run just for those 200k entries (which continue to get more every day). While 0,28 sec. would be fine normally, this query runs very often as it powers the main feature of my web-app, and often times it's part of a larger query.

Is there any way to speed this up? Obviosly MySQL has to run through all 200k entries every time and perform the great circle formula for every entry. I read something about geohashing, R-Trees and the like here on stackoverflow but I don't think that's the way I want to go. Partly because I've never been a big fan of maths, but mostly because I think that this problem has already been solved by someone smarter than me in a library/extension/etc. that has been tested extensively and is being updated regularly.

MySQL seems to have a spatial extension but that one doesn't provide a distance function. Should I be looking at another database to put this coordinate-pairs in? PostgreSQL seems to have a fairly mature Spatial extension. Do you know anything about it? Or would PostgreSQL too simply just use the great circle formula to get all entries within a certain region?

Is there maybe a specialized stand-alone product or mysql-extension that already does what I'm looking for?

Or is there maybe A PHP library I could use to do the calculations? Using APC I could easily fit the lat-long pairs into memory (those 200k entries take about 5MB) and then run the query inside of PHP. The problem with this approach however is that then I'd have a MySQL query like SELECT .. FROM .. WHERE id in (id1, id2, ..) for all the results wich can be up to a few thousand. How well does MySQL handle Queries like these? And then (since this is a number-crunching task) would doing this in PHP be fast enough?

Any other Ideas what I should/shouldn't do?

For completenes, here is the sample query, stripped of any irrelevant parts (as I said, usually this is part of a bigger query where I join multiple tables):

SELECT id, 6371 * acos( sin( radians( 52.4042924 ) ) * sin( radians( lat ) ) + cos( radians( 50.281852 ) ) * cos( radians( lat ) ) * cos( radians( 2.504883 ) - radians( lon ) ) ) AS dst
FROM geoloc
HAVING dst <10

Thank You!

doumeng1143 在半径(距离)内搜索仅10英里(15公里)时,你不能省略整个曲率方程并等于圆吗?
6 年多之前 回复


Calculate a bounding box to select a subset of the rows in the WHERE clause of your SQL query, so that you're only executing the expensive distance calculation on that subset of rows rather than against the entire 200k records in your table. The method is described in this article on Movable Type (with PHP code examples). Then you can include the Haversine calculation in your query against that subset to calculate the actual distances, and factor in the HAVING clause at that point.

It's the bounding box that helps your performance, because it means you're only doing the expensive distance calculation on a small subset of your data. This is effectively the same method that Patrick has suggested, but the Movable Type link has extensive explanations of the method, as well as PHP code that you can use to build the bounding box and your SQL query.


If you don't think haversine is accurate enough, then there's also the Vincenty formula.

//  Vincenty formula to calculate great circle distance between 2 locations expressed as Lat/Long in KM

function VincentyDistance($lat1,$lat2,$lon1,$lon2){
    $a = 6378137 - 21 * sin($lat1);
    $b = 6356752.3142;
    $f = 1/298.257223563;

    $p1_lat = $lat1/57.29577951;
    $p2_lat = $lat2/57.29577951;
    $p1_lon = $lon1/57.29577951;
    $p2_lon = $lon2/57.29577951;

    $L = $p2_lon - $p1_lon;

    $U1 = atan((1-$f) * tan($p1_lat));
    $U2 = atan((1-$f) * tan($p2_lat));

    $sinU1 = sin($U1);
    $cosU1 = cos($U1);
    $sinU2 = sin($U2);
    $cosU2 = cos($U2);

    $lambda = $L;
    $lambdaP = 2*M_PI;
    $iterLimit = 20;

    while(abs($lambda-$lambdaP) > 1e-12 && $iterLimit>0) {
        $sinLambda = sin($lambda);
        $cosLambda = cos($lambda);
        $sinSigma = sqrt(($cosU2*$sinLambda) * ($cosU2*$sinLambda) + ($cosU1*$sinU2-$sinU1*$cosU2*$cosLambda) * ($cosU1*$sinU2-$sinU1*$cosU2*$cosLambda));

        //if ($sinSigma==0){return 0;}  // co-incident points
        $cosSigma = $sinU1*$sinU2 + $cosU1*$cosU2*$cosLambda;
        $sigma = atan2($sinSigma, $cosSigma);
        $alpha = asin($cosU1 * $cosU2 * $sinLambda / $sinSigma);
        $cosSqAlpha = cos($alpha) * cos($alpha);
        $cos2SigmaM = $cosSigma - 2*$sinU1*$sinU2/$cosSqAlpha;
        $C = $f/16*$cosSqAlpha*(4+$f*(4-3*$cosSqAlpha));
        $lambdaP = $lambda;
        $lambda = $L + (1-$C) * $f * sin($alpha) * ($sigma + $C*$sinSigma*($cos2SigmaM+$C*$cosSigma*(-1+2*$cos2SigmaM*$cos2SigmaM)));

    $uSq = $cosSqAlpha*($a*$a-$b*$b)/($b*$b);
    $A = 1 + $uSq/16384*(4096+$uSq*(-768+$uSq*(320-175*$uSq)));
    $B = $uSq/1024 * (256+$uSq*(-128+$uSq*(74-47*$uSq)));

    $deltaSigma = $B*$sinSigma*($cos2SigmaM+$B/4*($cosSigma*(-1+2*$cos2SigmaM*$cos2SigmaM)- $B/6*$cos2SigmaM*(-3+4*$sinSigma*$sinSigma)*(-3+4*$cos2SigmaM*$cos2SigmaM)));

    $s = $b*$A*($sigma-$deltaSigma);
    return $s/1000;

echo VincentyDistance($lat1,$lat2,$lon1,$lon2);
douluo6626 谢谢你的回答马克,我会研究你的指针,并研究它! 干杯
3 年多之前 回复
doushibu2453 如果它是200k条目,那么你使用的是一个尺寸不合适的边界框。 无论如何,使用地理空间感知数据库对性能更好(特别是在2017年); 但是也可以创建汇总表来识别“网格”段中的点数,这可以让您了解您的边界框是否合理; 或替代数据结构,例如四叉树,这对于检索区域内的数据点效率要高得多
3 年多之前 回复
duanjiao6735 我知道这是一个老答案,但是我的边界框包含200k条目? 问题将保持不变。 每个人都建议使用这个解决方案,但是没有人会考虑到边界框实际上可以包含很多结果。
3 年多之前 回复
douping6871 - Epitaph或PHPDna(或他本周使用的任何名称)倾向于对PHPClasses中没有使用他的hilbertcurve解决方案的地理空间问题的任何答案进行投票,并且从未解释它是如何工作的,或者为什么它是一个更好的解决方案。 他确实回答了问题; 当他确实愿意在评论中做出回应时,他知道自己已经被阅读后会有这种令人讨厌的倾向,即再次删除它们
接近 7 年之前 回复
dpzyd8865 你和谁争论? 我没有看到任何“墓志铭”回答你:P。 请注意评论我的答案吗?
接近 7 年之前 回复
duankuai6586 你需要C,因为PHP本身是用C语言编写的。我正在做的是为PHP本身编写新的数据结构,而不是在PHP中编写数据结构
接近 7 年之前 回复
dongtao4890 不,我的意思是PHP的SPL(标准PHP库),特别是数据结构。 我目前正在研究SPLTrie和SPLQuadTree
接近 7 年之前 回复
dql1978dql1978 我理解quadtree是如何工作的,我现在正在为SPL编写一个C实现,它甚至不使用任何维度的数组......只是指向null或者指向NW / SW / NE / SE节点的指针。 树。 搜索速度很快,但填充四叉树的速度较慢...尤其是在使用前需要填充整个四叉树时需要大量数据
接近 7 年之前 回复
douyuanqia665858 我的答案确实回答了OP的问题......它显示了一种减少SQL查询检索的行数的方法,这样可以更快地搜索200k条目而无需将每个条目加载到PHP内存中的四叉树中... 那是问题的有效答案; 虽然它可能不是OP的唯一答案,但事实上是正确的......如果你想要注册PHPClasses就好了(虽然那里有很多编写得很好的库以及好的库),但不是 每个人都希望在新网站上注册以获得他们问题的答案
接近 7 年之前 回复
dove2199 - 我发布它是因为它易于人们理解。 我不试图用四叉树和希尔伯特曲线和摩尔曲线以及他们不理解的其他术语来哄骗人们...我发布它的基本解释如何工作...我发布它因为我可以发布实际 工作代码,而不是将人们指向他们需要注册甚至看到你的代码的网站上的图书馆....我不会系统地低估其他不同意我的人的答案
接近 7 年之前 回复
douping1825 - 也许你自己的答案应该真正解释四叉树是什么以及为什么它们如此之快,甚至可能如何使用它们而不是简单地说“我的库很酷,因为它使用了你无法理解的东西”。 ..也许那时你会获得赞成,并且不会觉得迫切需要通过贬低任何与你不同的答案来主张你的智力优势
接近 7 年之前 回复
dream2891 有人会关心解释最近的一个完美有效的答案,除非它是Phpdna ....在这种情况下,我知道他会拒绝任何人回答地理空间或距离相关的问题,而不是将人们指向他的四核库
接近 7 年之前 回复
douli7841 fyi:PI未定义,我使用了pi()
7 年多之前 回复
douzheng1853 - $ lat1
7 年多之前 回复
douyong4623 哪个lat在第一行? $ lat1或$ lat2?
7 年多之前 回复
duanjieyi6582 - 如果您正在使用带有WHERE子句的边界框,该边界框过滤掉明显超出范围的Lat / Long记录(特别是如果您有lat / long的索引),那么您将根据边界进行选择 在进行任何Haversine / Vincenty /距离计算之前的框...所以你不计算200k记录的距离,只是为了在边界框内的子集。 如果你尝试测试它,它会使它明显加快。
9 年多之前 回复
doubao12345 这个链接非常有用,谢谢。
9 年多之前 回复

What if you approach the problem from a different angle?

10 km in a straight line is:

  1. on the latitude is equal to ~1'(minute)
  2. on the longitude is equal to ~6'(minutes)

Using this as a basis, do some quick math and in your query add to the WHERE clause removing any locations that are outside the 'box' that is created by adding the buffer zone with the assumption of 1' lat & 6' long

gps buffer zone circle

Working from this image:

  1. GPS location you are searching for (34° 12' 34.0", -85° 1' 1.0") [34.2094444444, -85.0169444444]
  2. You find the min/max latitude/longitude

    2a. Min Latitude - 34.1927777778, -85.0169444444

    2b. Min Longitude - 34.2094444444, -85.1169444444

    2c. Max Latitude - 34.2261111111, -85.0169444444

    2d. Max Longitude - 34.2094444444, -84.9169444444

  3. Run your query with the min and max of each direction

    SELECT *
    FROM geoloc
    lat >= 34.1927777 AND
    lat <= 34.2261111 AND
    long >= -85.1169444 AND
    long <= -84.9169444;

You can either integrate the distance calculation with the SQL query or you can use a PHP library/class to run the distance check after pulling the data. Either way you have reduced the number of calculations by a large percentage.

I use the following function to calculate the distance between two US84 GPS locations. Two parameters are passed, each parameter is an array with the first element being the latitude and the second element being the longitude. I believe it has an accuracy to a few feet, which should be enough for all but the hardest core GPS-ophiles. Also, I believe this uses the Haversine distance formula.

$distance = calculateGPSDistance(array(34.32343, -86.342343), array(34.433223, -96.0032344));

function calculateGPSDistance($site1, $site2)
    $distance = 0;
    $earthMeanRadius = 2.0891 * pow(10, 7);

    $deltaLatitude = deg2rad($site2[0] - $site1[0]);
    $deltaLongitude = deg2rad($site2[1] - $site1[1]);
    $a = sin($deltaLatitude / 2) * sin($deltaLatitude / 2) + cos(deg2rad($site1[0])) * 
        cos(deg2rad($site2[0])) * sin($deltaLongitude / 2) * sin($deltaLongitude / 2);
    $c = 2 * atan2(sqrt($a), sqrt(1-$a));
    $distance = $earthMeanRadius * $c;

    return $distance;


I forgot to mention, my distance function will return distance in feet.

douaipi3965 赢得? 赢或输什么? OP只是试图找到一种更好的方式来使他们的网站工作。 有时,计算效率最高的算法并不是最好的。 在生产站点上更新代码时,易用性和集成非常重要
9 年多之前 回复
douxu4610 是的,这应该在欧洲没有问题。 我相信Haversine公式使用了大圆公式,但根据我的经验,它可以在现实世界中提供更准确的距离。 您可以在维基百科上阅读有关Haversine的内容
9 年多之前 回复
dongming5444 这对欧洲有用吗? hasrsine和大圆公式有什么区别?
9 年多之前 回复
douchaqi3369 什么是sfc?
9 年多之前 回复

到目前为止我所做的就像上面描述的@Mark一样。 对于小型网站我认为是一个可行的解决方案,但对我的情况来说并不是那么好(200k条目位于一个给定点周围的100x100平方公里的盒子里面。我使用的是Mark的相同技巧,但性能太差了.5个用户/ 第二次查询附近的lat / lon点几个小时,查询开始需要10到15秒;这发生在之后我调整了my.cnf中的mySQL设置。甚至不想 想想当全世界有200万个参赛作品会发生什么。</ p>

所以,现在是第2步的时间: Hilbert曲线
它应该解决(lat,lon)列上的B树索引问题,这是浪费的(onrange扫描,在B树索引的一部分) 通过在一列上使用一个索引(hilbert_number)来使用).hilbert_number是基于Hilbert曲线上的点的纬度/经度坐标计算的数字。</ p>

但是第二个问题, 通过Haversine公式测试固定点与之前结果子集之间的距离。 那部分可能很慢。 因此,我正在考虑以某种方式更直接地测试距离,将所有内容放在希尔伯特曲线上,并将一些位掩码应用于该结果子集,而不是应用Haversine公式。 我只是不知道我该怎么做... </ p>

无论如何,我用来减少结果子集中点数的另一个技巧是使用两个边界框和 仅在子集中包含灰色/白色点以进行进一步的Haversine测试:</ p>

</ p>

我现在需要做的是切换到希尔伯特数并看看它的行为方式。 但我怀疑这会增加10倍的性能!</ p>
</ div>



What I was doing till now is just as @Mark described above. A viable solution for small sites I guess, only not that good for my case (200k entries localized inside some 100x100 square km box centered around a given point. I was using this same trick of Mark's but performance is just too poor. 5 users/second querying for nearby lat/lon points for few hours and the queries start taking up to 10 - 15 seconds; and this happens after I have adjusted mySQL settings in my.cnf. Don't even want to think about what would happen when there will be 2 million entries worldwide.

So, now time for step 2: Hilbert curve. It should solve the problem of B-tree index on (lat, lon) columns which is wasteful (onrange scans, ony one part of the B-tree index is being used) by employing just one index on one column (hilbert_number). hilbert_number being a number calculated based on a point's lat/lon coordinates on the Hilbert curve.

But the second problem, of testing the distance between fixed point and everything from the previous result subset through the Haversine formula remains. That part can be very slow. So I was thinking about somehow testing for distance more directly, putting everything on the hilbert curve and applying some bitmask to that result subset instead of applying the Haversine formula. I just don't know how would I go about that...

Anyway, another trick I have employed to reduce the number of points in the result subset was to use two bounding boxes and include in the subset only the gray / white points for further Haversine testing:

inner and outer BB

What I need to do right now is switch to Hilbert numbers and see how it behaves. But I doubt this is going to increase 10x the performance!

dragonhong641016 嗯,所以我想Java SE中的四叉树速度应该很好吗? 你能指点我一个关于如何使用四叉树进行伪圆(而不是伪BB)搜索的好教程吗? 最好适应希尔伯特曲线; 最好也已经用Java实现了:D
接近 7 年之前 回复
douji6199 在某种程度上,这一切都取决于您的查询试图回答的问题。 如果你想要在中心点的某个半径范围内的所有点的列表,你甚至不需要计算内部边界框内的点的实际距离(只是框本身的角落),仅适用于那些点 位于内盒和外盒之间,看是否应该包括或丢弃它们; 但如果您想按距离订购列表,则需要计算所有可能的条目
接近 7 年之前 回复
dongxiezhuo8852 对于大数据量,我一直在尝试使用四叉树,并且发现它们的搜索速度非常快,但加载速度非常慢(而且对于较大的数据集而言,内存耗费很大)。 这就是为什么我一直在为PHP编写扩展来实现四叉树作为C数据结构而不是PHP数据结构...并且在纯PHP中实现相同的搜索和初始加载的性能要高得多... 但是如果我能将它序列化以便它可以保存在APC,memcached,redis或类似物中而不是必须在每次请求时从db重建
接近 7 年之前 回复
dongqing5925 对于大数据量,我一直在尝试使用四叉树,并且发现它们的搜索速度非常快,但加载速度非常慢(而且对于较大的数据集而言,内存耗费很大)。 这就是为什么我一直在编写PHP的扩展来实现四叉树作为C数据结构而不是PHP数据结构...并且在纯PHP中实现相同的搜索和初始加载的性能要高得多
接近 7 年之前 回复
dongmi5020 我自己使用了类似的边界框方法调整,从一个小框查询开始,然后向外增加但调整查询以便以前的结果不会再次返回,多个查询,但仍然相当选择性... 正如您所说,它对于小到中等数据量相当有效,并且比单个边界框更有效。 我更喜欢在内置数据库地理空间扩展时使用它们; 但这并不总是一种选择,并且边界框方法仍然可以很好地运行。
接近 7 年之前 回复

您可以尝试使用四核。 这是一个空间索引,减少了维度。 它将地图细分为图块,但您可以使用它来存储点。 你可以下载我的php类hilbert-curve @。 它还包括z曲线和摩尔曲线。 重要的是知道它使用墨卡托投影。 您可以查找Bing地图平铺。 它解释了如何使用四核。 您需要x,y坐标和z(缩放或深度)值。 然后它会给你一个四元组。</ p>
</ div>



You could try a quadkey. It's a spatial index and reduce the dimension. It subdivide a map into tiles but you can use it to store points. You can download my php class hilbert-curve @ It also includes a z-curve and a moore-curve. Important is to know it uses a mercator projection. You can look for Bing maps tiling. It explains how to use a quadkey. You need x,y coordinate and z (zoom or depth) value. Then it gives you a quadkey.

donglin317704291 这个答案不仅用不好的英语写成,它实际上并没有以任何有意义的方式回答这个问题。
7 年多之前 回复
dongtan2603 我看了你的PHP代码,但我不明白。 你有一个如何使用它的例子吗?
9 年多之前 回复
Csdn user default icon