MasterofProject

Distributed computing /Hadoop

Kissmelove01Kissmelove0111:57 02-26
Grade T1 69 reply

Ctrip interview a MapReduce topic.

Existing 1 million hotel coordinates and 2 billion landmark, which record the latitude and longitude of the landmark, please design the MapReduce calculation of all hotels within 1 km range of the landmark.

TntzbzcTntzbzc12:43 02-26
Grade T1 The 1 floor

[b] good title!!!!!!!!! [/b] Go home at night to do ~~~~~~[img=http://prog3.com/sbdm/forum/PointForum/ui/scripts/csdn/Plugin/001/face/13.gif][/img]

Kissmelove01Kissmelove0113:26 02-26
Grade T1 The 2 floor

At that time the interviewer is to say that the hotel coordinates and landmark coordinates are in a document, assuming that the field is Type (hotel / landmark) ID (Hotel id/ landmark ID) latitude longitude Here to solve the problem is how to design map in the same document with a hotel within 1 km of the landmark to find out, to the hotel ID for the key, the landmark ID for the value to reduce to deal with.

TntzbzcTntzbzc16:17 02-26
Grade T1 The 3 floor

[b] version of the discussion on the algorithm is really less, after all the algorithms to add 100 points [/b]

TntzbzcTntzbzc11:27 02-27
Grade T1 The 4 floor

I give up the latitude longitude, directly using a European two-dimensional plan to solve this problem Change the coordinates into X and Y, the smallest unit of rice, there is no decimal point There are two ways, but they all need 2 MR to solve 1, violence algorithm: The first MR: in the MAP phase of the poor to cite all the coordinates of the 1 million hotels (unit meters), so the hotel data will be redundant 1000000* (999*4+1000+2) = 4998000000 = 4 billion 900 million hotel data Format is Key = Y, Value X = hotel / landmark ID, type (hotel / landmark) Put the aggregated Count>1 KEY data into the HDFS - > tmpfile The output format is the number of hotel ID Second MR: read the tmpfile according to the hotel ID do a polymerization addition to get the final results This algorithm is the biggest defect hotel is exhaustive too much, causing MAP to Reduce output data is too large So I think second ways, on the basis of the first algorithm to achieve dimension reduction 1, dimensionality reduction algorithm: First MR: 1, according to the coordinates of each hotel, find 4 new coordinates (X/Y radius according to the method, see blue line +) The blue line rotation 45 degrees to get a new 4 coordinates, the red line The hotel and the landmark of the X/Y divided by the radius (1000) to get Xa/Ya, falling 8 points connected to the gray line within the hotel coordinates Xa/Ya is that we have to statistics out of the data [img=http://prog3.com/sbdm/img.bbs/upload/201402/27/1393471636_621693.jpg][/img] [b] but there is a problem, by 8 points formed by the 8 edge type does not cover the entire circle of all the landmark data [/b] The solution is to hotel coverage as a square (not round), edge width = *2. Cut the square into 4 small squares, each square, representing the area that a hotel can cover. This will appear 4*4=16 coordinates, remove the repeated coordinates of the final 9, so that you can guarantee to cover all of the hotel can be covered by the landmark. Which also includes a number of landmarks outside the radius, where the first to handle him, so reduce to deal with. [img=http://prog3.com/sbdm/img.bbs/upload/201402/27/1393475240_984482.jpg][/img] So each hotel only need 8+1 (don't forget the redundant data center), this Map Hotel output = 1000000*9 = 9 million Format is Key = Ya, Value = hotel / landmark ID, type (hotel / landmark), X, Y, Xa Reduce and before the same but need to do more than one step is to judge whether the distance between the hotel and the landmark in the radius Put the aggregated Count>1 KEY data into the HDFS - > tmpfile Second MR: the same as the first method, read the tmpfile according to the hotel ID do a polymerization addition to get the final results Code, and so I have to write out of space, LZ if you can try to write a free

SanguomiSanguomi21:39 02-27
Grade T1 The 5 floor

Can use on the sampling algorithm, and then block Otherwise the efficiency of reduce is too low

TntzbzcTntzbzc22:28 02-27
Grade T1 The 6 floor

[quote= quoted 5 floor sanguomi reply: " Can use on the sampling algorithm, and then block Otherwise the efficiency of reduce is too low [/quote] can give details about your ideas

SanguomiSanguomi09:56 02-28
Grade T1 The 7 floor

My initial idea is that if the 2 billion data and 2 million data, if not done in the region, The least complexity is O (2000000 * 2000000000). And reduce can only have one, which is certainly slow to deal with If it's what I do, 2 billion data, The first step MR sampling calculation according to the coordinates of X, or y to 100 or 1000 Get a relatively average of 100 x or Y coordinate values Second step MR take these points of the coordinates, the 2 billion coordinates and 200W coordinates are also divided into the corresponding 100 or 1000 Here we have to consider a boundary issue, taking into account the limits of the value of some hotels in the coordinates of the point of our point X/Y value So in the coordinates of the partitions of the data, the first block of the actual area is divided to x plus one kilometer, likewise the second quickest region cut down a km (calculated in the map, the two regions to the transmission of a data), hotel zone directly according to the sampling point to can, output key for value corresponding to the sampling point X or Y, data coordinate The third step Mr marker for two file Hotel tag 0 coordinate data of the tag is a combination of sampling points as the key, then reduce end for aggregation operations, coordinates hotel first, stored in containers, later the coordinates of each data are calculated and containers to preserve the hotel coordinate distance calculation, meet the output. If it is 1000 parts, then the basic is O (2000000 * 2000 * 1000) Fourth step MR merger It's not the point to say that it's still wet.

Java8964Java896410:31 02-28
Grade T1 The 8 floor

Is an interested question. This Here is my thoughts. 1) only need one MR job. We 2) the focus is how to design a custom key class, which will sort the data as the way we want. Remember, the MapReduce job, almost every key will be compared with each other to grouping/partition and sorted. this is the key feature provided from Hadoop, and we should utilize. Input of mapper will be (LongWritable, Text), default TextInputFormat as, the output of the mapper will be <KeyClass NullWriteable (). The but (). Most important is to The design this custom KeyClass. Are some important information Here related this key class I, will use pseudo code: Key Class { Type enum Double altitude Double latitude } This mapper In we have to have the custom SortComparator GroupComparator and PartitionComparator. Here are the trick parts of each:. 1) sort comparator: For Same type compare, altitude and latitude. For Different type For make sure all is less than s that will sort first then. 2) grouping and Partition comparator this, is very important part. For A Make sure each will be different key means they are different. So same will go to same reducer. B) for landmark data. If the compared object is another landmark, just compare their altitude and latitude, so different landmarks will omit as different, but if the compared object is hotel, and if 2 pair of altitude and latitude) are within 1 km, and mark these two objects are the same. So this landmark data will treat as the same grouping as the hotel data, and both will be treat as the same key, and sent to the same reducer. All this will just need one Mr job, this interview question is really to check if you are design a custom key, and override the grouping and partition comparator (make sure that if landmark data is within 1 km as hotel data, they will be changing and reside together, so they will go to the same reducer). So after mapper stage, you will have 100 million reducer group key, as you have 100 million hotel data, but all the landmark data within 1 km of the hotel will be treated as the same key of that hotel, then being sent to the same reducer. Hope I explain clearly in I English as Chinese typing is too slow for me.

TntzbzcTntzbzc11:02 02-28
Grade T1 The 9 floor

[quote= quoted 7 floor sanguomi reply: " My initial idea is that if the 2 billion data and 2 million data, if not done in the region, The least complexity is O (2000000 * 2000000000). And reduce can only have one, which is certainly slow to deal with If it's what I do, 2 billion data, The first step MR sampling calculation according to the coordinates of X, or y to 100 or 1000 Get a relatively average of 100 x or Y coordinate values Second step MR take these points of the coordinates, the 2 billion coordinates and 200W coordinates are also divided into the corresponding 100 or 1000 Here we have to consider a boundary issue, taking into account the limits of the value of some hotels in the coordinates of the point of our point X/Y value So in the coordinates of the partitions of the data, the first block of the actual area is divided to x plus one kilometer, likewise the second quickest region cut down a km (calculated in the map, the two regions to the transmission of a data), hotel zone directly according to the sampling point to can, output key for value corresponding to the sampling point X or Y, data coordinate The third step Mr marker for two file Hotel tag 0 coordinate data of the tag is a combination of sampling points as the key, then reduce end for aggregation operations, coordinates hotel first, stored in containers, later the coordinates of each data are calculated and containers to preserve the hotel coordinate distance calculation, meet the output. If it is 1000 parts, then the basic is O (2000000 * 2000 * 1000) Fourth step MR merger It's not the point to say that it's still wet. [/quote] Thank you so much for your sharing. 7 and 8 floor plan and I have a common point is that the hotel / landmark Y and [b] X divided by a coefficient [/b] (coefficient = radius 1000), this will be able to achieve dimensionality reduction But I don't quite understand after dimension reduction, dimension, you get the algorithm complexity is O (2000000 * 2000 * 1000) According to the second methods I mentioned before Landmark input map = 2000000000 Landmark in map dimension after input reduce = 2000000000/1000 = 2000000 Hotel input map = 1000000 * 9 Hotel Map reduce input = (1000000/1000) =1000*9 *9 Finally such a result is not should be more reasonable 2000000 * 1000* 9

SanguomiSanguomi11:30 02-28
Grade T1 The 10 floor

Back to the big wet You do this, is not good to do aggregation operation, such as that coordinate data no you hotel center coordinates to coordinates that, migration of a little bit, in addition to a magnitude of 8 regional processing is not simple.

TntzbzcTntzbzc11:31 02-28
Grade T1 The 11 floor

[quote= quoted 8 floor java8964 reply: " Is an interested question. This Here is my thoughts. 1) only need one MR job. We 2) the focus is how to design a custom key class, which will sort the data as the way we want. Remember, the MapReduce job, almost every key will be compared with each other to grouping/partition and sorted. this is the key feature provided from Hadoop, and we should utilize. Input of mapper will be (LongWritable, Text), default TextInputFormat as, the output of the mapper will be <KeyClass NullWriteable (). The but (). Most important is to The design this custom KeyClass. Are some important information Here related this key class I, will use pseudo code: Key Class { Type enum Double altitude Double latitude } This mapper In we have to have the custom SortComparator GroupComparator and PartitionComparator. Here are the trick parts of each:. 1) sort comparator: For Same type compare, altitude and latitude. For Different type For make sure all is less than s that will sort first then. 2) grouping and Partition comparator this, is very important part. For A Make sure each will be different key means they are different. So same will go to same reducer. B) for landmark data. If the compared object is another landmark, just compare their altitude and latitude, so different landmarks will omit as different, but if the compared object is hotel, and if 2 pair of altitude and latitude) are within 1 km, and mark these two objects are the same. So this landmark data will treat as the same grouping as the hotel data, and both will be treat as the same key, and sent to the same reducer. All this will just need one Mr job, this interview question is really to check if you are design a custom key, and override the grouping and partition comparator (make sure that if landmark data is within 1 km as hotel data, they will be changing and reside together, so they will go to the same reducer). So after mapper stage, you will have 100 million reducer group key, as you have 100 million hotel data, but all the landmark data within 1 km of the hotel will be treated as the same key of that hotel, then being sent to the same reducer. 我希望我能用英文解释清楚,因为我的打字速度太慢了。 我认为这将是一个完美的解决方案 但我不知道如何使用一个[或]先生的工作 当最后一个输出格式减少: 酒店99地标31 酒店2地标1034 酒店1地标41 酒店322地标9 酒店12地标199 ...... ...... 酒店98地标199122 酒店998地标122 酒店3地标223 最后的输出是不按酒店排序的,我认为一个先生的工作是可以的 但如果最后的输出需要对酒店数据进行排序,我认为这需要2份工作 酒店1地标1 酒店1地标4 酒店1地标12 酒店1地标31 酒店2地标455 酒店2地标1525 ..... ..... 酒店N-1地标1211 酒店N地标242323 酒店N地标4563432 【B】B)为地标数据,如果比较的对象是另一个地标,只是比较它们的海拔高度和纬度,所以不同的地标将省略不同,但如果比较的对象是酒店,如果2对(纬度)在1公里,标志这2个对象都是相同的。所以这地标数据将视为同一分组的酒店数据,都将被视为相同的密钥,并发送到同一个减速机。[/b] 如果我的想法是错误的,这个解释是关键 你想展示你的这种解释的代码吗 谢谢

Tntzbzctntzbzc02-28 11:36
等级 T1 12楼

[报价=引用10楼sanguomi的回复:] 回大湿 你这样做,不好做聚合操作吧,比如那种坐标数据没有你酒店圆心坐标来作圆心坐标的那种,偏移了一点点,另外还有8个幅度的区域处理也不简单啊[引用] 可以做的,那8个增点,一定覆盖所有圆内地标,而且只会多覆盖,不会少覆盖 被多覆盖的地标,需要在减少阶段剔除 你看我这张图 【img = HTTP:/ / prog3。COM /深圳/ IMG。论坛/上传/ 201402 / 28 / 1393558537_656758 .jpg ] [图片] 正方形的完全覆盖了原,而且那四个角还多出来一些面积

Sbwwkmydsbwwkmyd02-28 12:48
等级 T1 13楼

基于与KD树不行吗?

Nettmannettman02-28 13:22
等级 T1 14楼

【img = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 003 /猴子/ 1。GIF ] [下]

U013819169u01381916902-28 13:31
等级 T1 15楼

[网址= HTTP:/ / prog3。COM /深圳/ jcloud /米/作/作/ product_type = 10和project_id = 612和opus_zone?ID = 1759 ] [ IMG = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 001 /面/ 79。GIF ] [下] [网址]

U013819696u01381969602-28 13:35
等级 T1 16楼

打不开点百度快照聘ATM机取钱人员替人取钱企鹅2264401345高提成日结 工作就是到ATM机取钱,提成10%,当场支付会用ATM机取钱即可 要求:人品好,人要诚实,讲信用,保密意识强,不符合条件者勿扰

Nadleeh123nadleeh12302-28 13时54分
等级 T1 17楼

[报价=引用楼主kissmelove01的回复:] 现有100万酒店坐标和20亿地标,里面记录地标的经纬度,请设计MapReduce计算所有酒店1公里范围内的地标[引用]。 感觉很模糊,没给我文件格式,难道我还去设计个格式?要有高效率必须建立在有规律的数据上。 如果按通常格式来说,可以借鉴下grep的实现方式[ IMG = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 003 /洋葱/ 2。GIF ] [下]

Vbplayboyvbplayboy02-28 15:07
等级 T1 18楼

记录你浏览网页喜欢的图片和视频等pyoulike.com

CodeMadmancodemadman02-28 15:43
等级 T1 19楼

6个字符】【凑满函数式?

Lkf181lkf18102-28 23:08
等级 T1 20楼

高手们啊~~~~~···

U013843287u01384328703-01 17:33
等级 T1 21楼

高手们啊~~~~~···

Zanfeng03-01 18:41
等级 T1 22楼

按经纬度分块存酒店,地点坐标时就存一个块里。 取的时候只要从块里面取就可以了。 循环判断一个块里是不是距离一里..

Lwwenlonglwwenlong03-02 20:29
等级 T1 23楼

[报价=引用11楼tntzbzc的回复:] [报价=引用8楼java8964的回复:] 这是一个感兴趣的问题。这是我的想法。 1)我们只需要一份工作。 2)重点是如何设计一个自定义键,将数据作为我们想要的方式排序。记住,在MapReduce的工作,几乎每一个关键会互相比较,分组/分区和排序。这是从Hadoop的重要特征,我们应该利用。 映射器的输入将(longwritable,文本),默认textinputformat,但成像仪的输出将<<关键类,nullwriteable)。 最重要的是设计这一习俗关键类。 这里有一些重要的信息相关的这个关键类,我会用伪代码: 类键{ 枚举类型 高度加倍 纬度双 } 在这个映射,我们必须习惯sortcomparator,groupcomparator和partitioncomparator。这里是每个部分的伎俩: 1)排序比较器: 同类型,比较高度和纬度。 对于不同的类型,确保所有酒店小于地标,将酒店排序第一,然后地标。 2)用于分组和分区比较,这是非常重要的部分。 a)确保每个酒店将不同的关键,意味着他们是不同的。所以同样的酒店会去同样的减速器。 B)为地标数据,如果比较的对象是另一个地标,只是比较它们的海拔高度和纬度,所以不同的地标将省略不同,但如果比较的对象是酒店,如果2对(纬度)在1公里,标志这2个对象都是相同的。所以这地标数据将视为同一分组的酒店数据,都将被视为相同的密钥,并发送到相同的减速器。 所有这一切都只需要一个乔布斯,这个面试问题真是检查如果你设计一个自定义键,并重写分组和分区比较器(确保如果地标数据是在1公里作为酒店数据,他们将分组分在一起,所以他们会去相同的减速器)。所以在映射阶段,你将有100万减速器组密钥,当你有100万酒店数据,但所有的地标数据在1公里的酒店将被视为是酒店相同的密钥,然后被发送到相同的减速器。 我希望我能用英文解释清楚,因为我的打字速度太慢了。 我认为这将是一个完美的解决方案 但我不知道如何使用一个[或]先生的工作 当最后一个输出格式减少: 酒店99地标31 酒店2地标1034 酒店1地标41 酒店322地标9 酒店12地标199 ...... ...... 酒店98地标199122 酒店998地标122 酒店3地标223 最后的输出是不按酒店排序的,我认为一个先生的工作是可以的 但如果最后的输出需要对酒店数据进行排序,我认为这需要2份工作 酒店1地标1 酒店1地标4 酒店1地标12 酒店1地标31 酒店2地标455 酒店2地标1525 ..... ..... 酒店N-1地标1211 酒店N地标242323 酒店N地标4563432 【B】B)为地标数据,如果比较的对象是另一个地标,只是比较它们的海拔高度和纬度,所以不同的地标将省略不同,但如果比较的对象是酒店,如果2对(纬度)在1公里,标志这2个对象都是相同的。所以这地标数据将视为同一分组的酒店数据,都将被视为相同的密钥,并发送到同一个减速机。[/b] 如果我的想法是错误的,这个解释是关键 你想展示你的这种解释的代码吗 谢谢[ /报价] 怎么设计分割类呢????

U011702993u01170299303-03 10:51
等级 T1 24楼

【img = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 003 /猴子/ 25。GIF ] [下]

U013061236u01306123603-03 11:43
等级 T1 25楼

【img = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 003 /猴子/ 34。GIF ] [下] 我是来学习的

Chouychouy03-03 14:24
等级 T1 26楼

不是有Geohash么

Ftjavaypftjavayp03-03 15:11
等级 T1 27楼

【img = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 003 /猴子/ 9。GIF ] [下]

Tntzbzctntzbzc03-03 20:46
等级 T1 28楼

[报价=引用23楼lwwenlong的回复:] 怎么设计分割类呢????[引用] 帖子里说不清楚,我贴代码了 [报价=引用8楼java8964的回复:] 我希望我能用英文解释清楚,因为我的打字速度太慢了。 把你、7楼和我的方案结合了一下,上代码 写的比较乱,随便看看吧 创建测试数据 [代码] 进口java.io.outputstream; 进口java.util.random; 进口org.apache.hadoop.conf.configuration; 进口org.apache.hadoop.fs.filesystem; 进口org.apache.hadoop.fs.path; 公共课ctrip001_createdata { public static void main(String [] args){抛出异常 100 / 20现有万酒店坐标和亿地标,里面记录地标的经纬度,请设计MapReduce计算所有酒店1公里范围内的地标。 / /类型(酒店/地标)ID(酒店ID /地标ID)纬度经度 1600 /假设一个平方公里的小城市= 40 * 40 = 40000公里公里米×40000米 1000 /假设有个酒店,1000000个地标 int len = 40000; 国际酒店= 1000; int LM = 1000; 随机跑=新(); int类型= 0;// 0酒店1地标 字符串hdfs_path =“/tmp /酒店”; 配置hadoopconf =新configuration(); hadoopconf。集(“FS默认名称”、“HDFS:/ / M04。CT1。R01。HDP:9000”); 梅雨锋系=文件系统文件系统得到(hadoopconf); OutputStream =梅雨锋系。创造(新路径(hdfs_path)); 为(int i = 0;i <酒店;i++){ //用两个为循环嵌套,打乱酒店/地标数据 / /类型(酒店/地标)ID(酒店ID /地标ID)纬度经度 int = nextInt(LEN)跑; int HY =跑。nextInt(LEN); 类型= 0; StringBuilder STR =新stringbuilder(); str.append(型);/ /类型(酒店/地标) str.append(“T”); str.append(我10000);// id(酒店ID /地标ID) str.append(“T”); str.append(HX);/ /酒店X str.append(“T”); str.append(HY);/ /酒店Y str.append(“\r\n”); 类型= 1; 为(int k = 0;K<LM;K+){ int LMX =跑。nextInt(LEN); int l我=跑。nextInt(LEN); str.append(型);/ /类型(酒店/地标) str.append(“T”); str.append(我* 10000 + K + 1);// id(酒店ID /地标ID) str.append(“T”); str.append(LMX);/ /地标X str.append(“T”); str.append(LMY);/ /地标Y str.append(“\r\n”); } 字节[] midbytes1 =(结构tostring())。getBytes(“utf8”); (midbytes1)写出来; 系统,println(我1000); } 出来。(); } } [ /代码] 1公里内的所有地标MapReduce找出所有酒店 [代码] 进口java.io.datainput; 进口java.io.dataoutput; 进口; 进口java.util.arraylist; 进口org.apache.hadoop.conf.configuration; 进口org.apache.hadoop.fs.filesystem; 进口org.apache.hadoop.fs.path; 进口org.apache.hadoop.io.doublewritable; 进口org.apache.hadoop.io.intwritable; 进口org.apache.hadoop.io.longwritable; 进口org.apache.hadoop.io.nullwritable; 进口org.apache.hadoop.io.text; 进口org.apache.hadoop.io.writablecomparable; 进口org.apache.hadoop.io.writablecomparator; 进口org.apache.hadoop.mapreduce.job; 进口org.apache.hadoop.mapreduce.mapper; 进口org.apache.hadoop.mapreduce.partitioner; 进口org.apache.hadoop.mapreduce.reducer; 进口org.apache.hadoop.mapreduce.lib.input.fileinputformat; 进口org.apache.hadoop.mapreduce.lib.output.fileoutputformat; 公共课ctrip001 { public static void main(String [] args){抛出异常 字符串=“/tmp /酒店”; 字符串=“/甲氧苄啶/ hotel001”; 配置conf =新configuration(); conf.set(“mapred。工作。跟踪器”、“M04。CT1。R01。HDP:9001”); 工作=新工作(机密,“酒店”); 工作。setjarbyclass(ctrip001。类); 文件系统FS =文件系统。得到(机密); 删除(新路径(出),真); fileinputformat。addinputpath(工作、新路径(中)); fileoutputformat。setoutputpath(工作、新路径(出)); 工作。setmapperclass(hotelmap。类); 工作。setmapoutputkeyclass(酒店类); 工作。setmapoutputvalueclass(nullwritable。类); 工作。setreducerclass(hotelreducer。类); 工作。setoutputkeyclass(文字类); 工作。setoutputvalueclass(nullwritable。类); Job.setNumReduceTasks (5); / / according to the hotel landmark data lines, or reduce. In order to ensure the speed, the about 2000000000 line data is about 500 to 1000 Reduce (according to different models to change the configuration) Job.setGroupingComparatorClass (HotelGrouping.class); Job.setPartitionerClass (HotelPartitioner.class); Job.waitForCompletion (true); } Static class HotelMap extends Mapper<LongWritable Text, public, Hotel, NullWritable> { Void map public (key LongWritable, value Text, context Context) java.io.IOException throws, InterruptedException { Hotel Hotel = Hotel new (); Hotel.setHotel (value); Context.write (Hotel, NullWritable.get) (); / / or Landmark Hotel (if) (hotel.getType = = 0) {/ / redundant 8 centers for the hotel Hotel.setXYa (1, 0); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (1, 1); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (1, -1); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (-1, 0); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (-1, 1); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (-1, -1); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (0, 1); Context.write (Hotel, NullWritable.get ()); Hotel.setXYa (0, -1); Context.write (Hotel, NullWritable.get ()); } Return; } } Static class HotelReducer extends Reducer<Hotel NullWritable, public, Text, NullWritable> { @SuppressWarnings ("unused") Void reduce public (key Hotel, value Iterable<NullWritable>, context Context) IOException throws, InterruptedException { Count int = 0; ArrayList<Hotel> MyHotel = new ArrayList<Hotel> (s); / / a group may have multiple Hotel and ArrayList as hotel container For (x NullWritable: value) { If (count = = 0 & & key.getType) = 0 (!) Break; (if) (key.getType = = 0) { {try MyHotel.add ((Hotel) (key.clone) ()); } Catch (CloneNotSupportedException E) {} {else} { For (H Hotel: MyHotel) { Dis double = Math.sqrt (Math.pow (h.getX) - key.getX (), 2) + Math.pow (h.getY () - key.getY (), 2); If (DIS < = 1000) { Output format: ID / ID distance (hotel landmark test). Abandon landmark data without Hotel relationship Context.write (Text (New) + "\t" + "key.getID" + "\t" + "h.getID"" + (int) dis (), NullWritable.get ()); } } } ++count; } } } } The hotel / landmark data re divided into different reduce HotelPartitioner extends Partitioner<Hotel NullWritable>, class { Int getPartition public (key Hotel, value NullWritable, numPartitions int) { Math.abs ((int) (return) (key.getXa () () + key.getYa ()% numPartitions); }; } / / custom Group according to the 1000*1000 block reshuffle polymerization HotelGrouping extends WritableComparator class { HotelGrouping protected () { Super (Hotel.class, true); } @SuppressWarnings ("rawtypes") @Override / / WritableComparables. Int compare public (W1 WritableComparable, W2 WritableComparable) { H1 Hotel = (Hotel) w1; H2 Hotel = (Hotel) w2; R1x int = h1.getXa (); R1y int = h1.getYa (); R2x int = h2.getXa (); R2y int = h2.getYa (); R int = 0; If (r1x = = r2x) { If (r1y = = r2y) { R = 0; {else} { If (r1y > r2y) { R = 1; } {else R = -1; } } {else} { If (r1x > r2x) { R = 1; } {else R = -1; } } R return; } } Hotel implements WritableComparable<Hotel> Cloneable, class { IntWritable Type private; IntWritable ID private; IntWritable X private; IntWritable Y private; IntWritable Xa private; IntWritable Ya private; DoubleWritable Distance private; { Type = IntWritable new (); ID = IntWritable new (); X = IntWritable new (); Y = IntWritable new (); Xa = IntWritable new (); Ya = IntWritable new (); Distance = DoubleWritable new (); } Int getType public () { Type.get return (); } Int getID public () { ID.get return (); } Int getX public () { X.get return (); } Int getY public () { Y.get return (); } Int getXa public () { 返回get() XA; } 公共getya() { int get()回报你; } 公共无效setxya(int XD,int YD){ 本。XA。集(X get() / 1000 + XD); 本。雅。集(Y get() / 1000 + YD); } 公众的双重getdistance() { get()返回的距离; } 公共无效sethotel(文本STR){ 字符串[] V =结构tostring()。分裂(“T”); 型。集(整数。parseInt(V [ 0 ])); id.set(整数。parseInt(V [ 1 ])); x.set(整数。parseInt(V [ 2 ])); y.set(整数。parseInt(V [ 3 ])); XA。集(整数。parseInt(V [ 2 ])/ 1000); 你集(整数。parseInt(V [ 3 ])/ 1000); 距离。集(数学。sqrt(数学。战俘(getx(),2)+数学。战俘(getx(),2))); } “重写” 保护对象()抛出clonenotsupportedexception { / /待办事项自动生成方法存根 回超(); } “重写” public void ReadFields(输入的)抛出IOException { 类型=新intwritable(在。readint()); ID =新intwritable(在。readint()); X =新intwritable(在。readint()); Y =新intwritable(在。readint()); xa =新intwritable(在。readint()); 雅=新intwritable(在。readint()); 距离=新doublewritable(在。readdouble()); } “重写” public void写(数据输出)抛出IOException { 出来。writeint(类型。get()); 从writeint(ID get()); 从writeint(X get()); 从writeint(Y get()); 出来。writeint(XA。get()); 出来。writeint(雅。get()); 出来。writedouble(距离。get()); } “重写” public int compareTo(酒店){ int r = 0; 如果(这。getxa() = = H getxa()){ 如果(这。getya() = = H getya()){ R =这。gettype() -- gettype();/ /实现洗牌阶段二次排序,把酒店放在每个组的最前面 { } 如果(这。getya() > H. getya()){ = 1; } 否则{ = - 1; } } { } 如果(这。getxa() > H. getxa()){ = 1; } 否则{ = - 1; } } 返回; } “重写” 公共字符串tostring() { 返回getxa() +“T”+ getya() +“T”+ gettype() +“T”+ getid() +“T”+ getx() +“T”+ gety(); } } [ /代码]

Sanguomisanguomi03-03 23:03
等级 T1 29楼

大湿的代码应该没什么问题了 大概是使对应的酒店的地标数据8个方位各自产生一条数据,作用是利用这8条数据分区到对应的地标数据中,使每个酒店都能匹配到一个正方形内的地标数据。 不知道描述的是否正确..

Sanguomisanguomi03-03 23:10
等级 T1 30楼

其实是原来的1份酒店地标变成9份,然后和所有地标数据做加入

Tntzbzctntzbzc03-04 10:26
等级 T1 31楼

[报价=引用29楼sanguomi的回复:] 大湿的代码应该没什么问题了 大概是使对应的酒店的地标数据8个方位各自产生一条数据,作用是利用这8条数据分区到对应的地标数据中,使每个酒店都能匹配到一个正方形内的地标数据。 不知道描述的是否正确.. [引用] [报价=引用30楼sanguomi的回复:] 其实是原来的1份酒店地标变成9份,然后和所有地标数据做[引用]加入 对的

Sanguomisanguomi03-04 10:30
等级 T1 32楼

刚刚测试发现了问题。 [代码] 为(文本×:值){ 如果(计数= = 0和关键。gettype()!= 0){ 系统,println(“无”); 标签=真; 系统的输入(字符串值(关键。getx())+“”+字符串值(关键。gety())); 系统,println(上下文); / /休息; } [ /代码] 拿着这些打印出来的坐标数据 在地图中直接和每个酒店数据计算,发现有符合1公里的需求。 [代码] DIS =数学。sqrt(数学。战俘(酒店。getx() - 17704,2)+ 数学。战俘(酒店。gety() - 25787,2)); 如果(解散< = 1000){ 系统,println(“******”); } [ /代码] 看来应该是分组迭代的时候有问题,

Tntzbzctntzbzc03-04 10:44
等级 T1 33楼

[报价=引用32楼sanguomi的回复:] 刚刚测试发现了问题。 [代码] 为(文本×:值){ 如果(计数= = 0和关键。gettype()!= 0){ 系统,println(“无”); 标签=真; 系统的输入(字符串值(关键。getx())+“”+字符串值(关键。gety())); 系统,println(上下文); //break; } [/code] Take these print out of the coordinate data Directly in the map and the calculation of each hotel data, found that there is in line with the needs of 1 km. [code=java] Dis = Math.sqrt (Math.pow (hotel.getX) - 17704, 2) + Math.pow (hotel.getY () - 25787, 2); If (DIS < = 1000) { System.out.println ("*"); } [/code] It seems that there should be a problem when grouping iteration, [/quote] Read the letter

SanguomiSanguomi12:31 03-04
Grade T1 The 34 floor

[quote= quoted 33 floor tntzbzc reply: " [quote= quoted 32 floor sanguomi reply: " Has just been tested and found the problem. [code=java] For (x Text: value) { If (count = = 0 & & key.getType (0) = {!) System.out.println ("NO"); Tag = true; System.out.println (String.valueOf (key.getX) + "" + String.valueOf (key.getY ()); System.out.println (context); //break; } [/code] Take these print out of the coordinate data Directly in the map and the calculation of each hotel data, found that there is in line with the needs of 1 km. [code=java] Dis = Math.sqrt (Math.pow (hotel.getX) - 17704, 2) + Math.pow (hotel.getY () - 25787, 2); If (DIS < = 1000) { System.out.println ("*"); } [/code] It seems that there should be a problem when grouping iteration, [/quote] Read the letter [/quote] Sorry, the test data copy is wrong.

U013489254U01348925406:34 03-05
Grade T1 The 35 floor

The output of the latitude and longitude into key can effectively reduce the burden of reducer Two points on the map is less than 1 km distance, then the value of certain latitude and longitude in a certain precision range of the same (for example, a decimal point) is actually fuzzy take a square to reduce the amount of computation For example, a ID latitude and longitude 37.7833 degrees N, 122.4167 degrees mapper key W output is "N0037.7, W0122.4" or "N0037.7" + two key "W0122.4"" So that each reducer the amount of data is very small, directly determine the range of the hotel and the landmark distance and then the output of qualified pair or map on the line

U013884523U01388452308:59 03-05
Grade T1 The 36 floor

[quote= quoted 35 floor u013489254 reply: " The output of the latitude and longitude into key can effectively reduce the burden of reducer Two points on the map is less than 1 km distance, then the value of certain latitude and longitude in a certain precision range of the same (for example, a decimal point) is actually fuzzy take a square to reduce the amount of computation For example, a ID latitude and longitude 37.7833 degrees N, 122.4167 degrees mapper key W output is "N0037.7, W0122.4" or "N0037.7" + two key "W0122.4"" So that each reducer has a small amount of data is very small, directly determine the range of the hotel and the landmark distance and then the output of qualified pair or map on the line [/quote] Too idealistic! Assume that there are 38.0001w 37.9999n's hotel and 37.9999w 38.0001n's landmark Distance within one kilometer Press your make hotel to become 38, 38 Landmark turns 37.99, 37.99 That's not the way to get them out into a group. And took all the answers. In addition to the 8 floor and moderator of the program, others are less reliable

U013489254U01348925409:37 03-05
Grade T1 The 37 floor

[quote= quoted 36 floor u013884523 reply: " Too idealistic! Assume that there are 38.0001w 37.9999n's hotel and 37.9999w 38.0001n's landmark Distance within one kilometer Press your make hotel to become 38, 38 Landmark turns 37.99, 37.99 That's not the way to get them out into a group. [/quote] Probably I did not describe clearly, the distance N can affect the latitude and longitude of the number of K, then the K bit before the latitude and longitude data in five to four homes in the future will not change

U010611517U01061151711:38 03-05
Grade T1 The 38 floor

Now a lot of mobile phone software is used in the vicinity of the search function, but the specific is how to achieve it On the Internet to find a lot of information, the MySQL database, rectangle algorithm, geohash I have used over the when the data on the millions after MySQL spatial database is the most accurate (query before 100 data only about 5 seconds). Next, the introduction of an original calculation method, the query speed is 2 times the MySQL spatial database algorithm $lng is your longitude, $lat is your latitude LNG lat, SELECT, (POWER (MOD (LNG - $lng), 360 (), 2) + POWER (ABS ($lat - LAT), 2) distance AS `user_location` FROM BY distance LIMIT ORDER 100 After testing, in the 1 million data to take the first 100 data only about 2.5 seconds. #################################### The other algorithms are shown here: Distance form algorithm Define (EARTH_RADIUS, 6371); / / the radius of the earth, the average radius of 6371km / * * * calculate the four point of a square around a certain latitude and longitude. * LNG float *@param longitude Lat float *@param latitude *@param distance float the point where the radius of the circle, the circle and the inscribed square, the default value is 0.5 kilometers The latitude and longitude coordinates of the four points of the array *@return square * / ReturnSquarePoint function ($lng, $lat, $distance = 0.5) { $dlng = 2 * asin (sin ($distance / (2 * EARTH_RADIUS)) / cos (deg2rad ($lat)) (); $dlng = rad2deg ($dlng); $dlat = $distance/EARTH_RADIUS; $dlat = rad2deg ($dlat); Array return ( 'left-top'=>array ('lat'=>$lat + $dlat,'lng'=>$lng-$dlng), 'right-top'=>array ('lat'=>$lat +'lng'=>$lng, $dlat + $dlng), 'left-bottom'=>array ($dlat -'lng'=>$lng, $dlng -'lat'=>$lat), 'right-bottom'=>array ($dlat -'lng'=>$lng,'lat'=>$lat + $dlng) ); } / / use this function calculated results, into SQL query. $squares = returnSquarePoint ($lng, $lat); $info_sql SQL = "select id, locateinfo, lat, LNG from `lbs_info` `lbs_info` where lat<>0 and lat>{$squares['right-bottom']['lat']} and lat<{$squares['left-top']['lat']} and lng>{$squares['left-top']['lng']} and lng<{$squares['right-bottom']['lng']}"; Two, spatial database algorithm The following location field is the spatial data generated with the latitude and longitude, such as: The location field of the type is set to point "Feed set location=GEOMFROMTEXT update ('point ({$lng} {$lat})") id='{$id}'where" MySQL spatial data query @center SET = GEOMFROMTEXT ('POINT (35.801559 -10.501577)); @radius SET = 4000; @bbox SET = CONCAT ('POLYGON (",", X (@center) - @radius, '', Y (@center) - @radius, '', '', X (@center) + @radius, '', Y (@center) - @radius, '', '', X (@center) + @radius, '', Y (@center) + @radius, '', '', X (@center) - @radius, '', Y (@center) + @radius, '', '', X (@center) - @radius, '', Y (@center) - @radius, ')' ); ID LNG, SELECT, lat, SQRT (POW (X (location) - X (@center), 2) + POW (ABS (location) - Y (Y) - ABS (@center), 2) distance AS `user_location` WHERE 1=1 FROM INTERSECTS AND (location, GEOMFROMTEXT (@bbox)) SQRT AND (ABS (X (location) - X (@center), 2) + POW (ABS (Y (location) - Y (@center)), 2) () < @radius () (POW () () () () () BY distance LIMIT ORDER 20 Three, geo algorithm

U013436198U01343619812:40 03-05
Grade T1 The 39 floor

With the moderator of the code to measure, no problem [quote= quoted 37 floor u013489254 reply: " [quote= quoted 36 floor u013884523 reply: " Too idealistic! Assume that there are 38.0001w 37.9999n's hotel and 37.9999w 38.0001n's landmark Distance within one kilometer Press your make hotel to become 38, 38 Landmark turns 37.99, 37.99 That's not the way to get them out into a group. [/quote] Probably I did not describe clearly, with a distance of N can affect the latitude and longitude of the number of K, then the K bit before the latitude and longitude data in the four homes five into the future is not changing [/quote] 37 floor plan still can not solve the problems raised by the 36 floor

Zhangkai08111Zhangkai0811115:59 03-05
Grade T1 The 40 floor

100W put the memory is no problem and do not require the output format.. The question is not used to reduce 100W will put the memory before map, and then determine the filter output at map end one by one... If you want to format data, then come to a reduce. According to the hotel group, the output desired format..

U013489254U01348925416:21 03-05
Grade T1 The 41 floor

[quote= quoted 39 floor u013436198 reply: " 37 floor plan still can not solve the problems raised by the 36 floor [/quote] Four homes five into the accuracy of the why can not solve the problem? Set the accuracy of K, mathematics can not find a meet the conditions of the K so that as long as five to four into the K after the latitude and longitude of a certain gap between the actual distance of more than 1km?

U012677351U01267735115:27 03-06
Grade T1 The 42 floor

Hello, everyone.

Aby913Aby91312:17 03-07
Grade T1 The 43 floor

Are master ah come in to learn

Kissmelove01Kissmelove0113:24 03-08
Grade T1 The 44 floor

Thank you very much for a few days, this busy interview, did not see.

Java8964Java896406:08 03-09
Grade T1 The 45 floor

Kind, of busy in the last Sorry 2 weeks. 我八楼的算法是不对的,当时考虑的不仔细。如果一个地标在超过一个以上的酒店1公里内,原来的方法无法处理,就像23楼的问题,分割做不了,作为一项关键只能发送到一个分区,所以在上面的例子中,它是不可能把这一地标2减速器。 这个问题是一个真正的KNN连接问题(K近邻加入)。如果双方都是大的,它会找到一个最好的算法很难处理。但要记住这是一个面试问题。如果我在面试的基础上提出问题,这将是我的答案: 只有一百万家酒店坐标数据集,使用2倍来表示它将只需要2×8字节×1000000,大约1600万的记忆,作为坐标只是2双类型的值。我将所有的酒店坐标使用Hadoop分布式缓存发送给所有的映射。 该程序只会处理所有的地标数据作为输入,再加上分发酒店缓存数据,所以你可以忽略酒店的ID作为输出的关键,而地标数据如果地标在1公里内为价值。在减速,你得到所有的这家酒店的地标。 我相信这是一个简单的问题,因为最关键的是,我们只有1酒店。当然,如果我们增加数10亿,然后做同样的KNN连接解决方案。:-)

Java8964java896403-09 06:25
等级 T1 46楼

加一句,Hadoop版现在比较冷清(虽然大家都知道BigData热门),不是没人,而是好问题真的很少见。

Coolbamboo2008coolbamboo200803-09 17:10
等级 T1 47楼

好题目,帮楼主顶了!

Tntzbzctntzbzc03-09 22:34
等级 T1 48楼

[报价=引用44楼kissmelove01的回复:] 感谢版主,这几天忙着面试,没怎么来看[引用]。 我代码有个小bug,修正了 [代码] hotelpartitioner扩展分区= Java类<<酒店,nullwritable > { public int getpartition(酒店钥匙,nullwritable值,int numpartitions){ 返回数学。ABS((int)((关键。getxa() +键。getya())% numpartitions)); }; } [ /码] 你的问题很给力,祝你面试成功,如果遇到好的题目,一定要发哦,3Q

Sunxing007sunxing00703-10 15:22
等级 T1 49楼

我有一个注意,包行。MongoDB支持地理位置,且支持地理位置索引,MongoDB本身支持切分/ mapraduce,问题应该很简化。[ IMG = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 001 /面/ 13。GIF ] [下]

U011828076u01182807603-11 17:45
等级 T1 50楼

【img = HTTP:/ / prog3。COM /深圳/论坛/ pointforum /用户界面/脚本/编程/插件/ 003 /猴子/ 31。GIF ] [下]

Xiaoxiangqingxiaoxiangqing03-12 09:15
等级 T1 51楼

高难度,期待高人出现

U014798162u01479816204-29 14:19
等级 T1 52楼

且学且珍惜,得认学习真正消化

Sandj_13_14sandj_13_1405-07 15:54
等级 T1 53楼

请问大湿,你的暴力算法1,时间复杂度怎么计算,谢谢求详解!

Xlk23xlk2305-09 14:57
等级 T1 54楼

我也想知道,大湿的代码,时间复杂度是多少?

Liyang417800liyang41780006-03 16:16
等级 T1 55楼

长见识了~ ~ ~

He36363636he3636363608-22 16:00
等级 T1 56楼

学习·····

Xxzzpp123456xxzzpp12345611-05 18:07
等级 T1 57楼

酒店坐标为(x0,y0),坐标为(x1,y1),则(x1-x0)平方+(y1-y0)平方<= 1公里,这是函数必须要满足的不等式 1。同一个文件中找出所有的酒店坐标 2。两层循环,第一层循环为所有的非酒店坐标,第二层循环为所有的酒店坐标,满足不等式条件记录到对应酒店对应的关键的集合中

Lessonnairlessonnair11-12 11:54
等级 T1 58楼

可以先对文件进行分割,分为两部分,酒店部分和地标部分;酒店部分大小大概为10m左右。将酒店部分存入内存,对地标部分做地图,一次地图(地标ID,地标坐标)-------- >得到多个(酒店ID,地标ID)(关键点:地图过程中酒店数据全部存在内存里,地标数据存在文本中),然后再减少得到(酒店ID,多个地标ID)

U012523116u01252311612-11 10:49
等级 T1 59楼

Is God class, I this just out of school to see the right? I am very interested in ah, but Hadoop are not getting started ah

ZccaogongZccaogong12:45 03-12
Grade T1 The 60 floor

Good learning is to analyze and discuss! In research!

Xq_yfXq_yf14:51 03-18
Grade T1 The 61 floor

Dig down the grave, and in front of the method almost, through the elimination of XY is impossible, plus one: take to determine the scope of the final calculation is not sure. [img=http://prog3.com/sbdm/img.bbs/upload/201503/18/1426661225_45557.jpg][/img]

U012243846U01224384611:30 03-31
Grade T1 The 62 floor

[img=http://prog3.com/sbdm/forum/PointForum/ui/scripts/csdn/Plugin/003/monkey/2.gif][/img] Is God, Study hard white..

Baolibin528Baolibin52821:09 04-05
Grade T1 The 63 floor

[img=http://prog3.com/sbdm/forum/PointForum/ui/scripts/csdn/Plugin/003/monkey/20.gif][/img]

Liyang0410Liyang041010:20 06-01
Grade T1 The 64 floor

[img=http://prog3.com/sbdm/forum/PointForum/ui/scripts/csdn/Plugin/003/onion/71.gif][/img] I'm here to learn, and recently I've got a MapReduce,

U013457065U01345706518:52 07-02
Grade T1 The 65 floor

This is not at all on one level..

U011252761U01125276123:54 07-14
Grade T1 The 66 floor

Say what I think, first calculate the latitude and longitude of the hotel within one kilometer of the range, and then to find out all the landmarks. So the amount of calculation will be greatly reduced.

JsrlzJsrlz11:37 07-17
Grade T1 The 67 floor

Are master ah come in to learn

U013047426U01304742611:56 07-28
Grade T1 The 68 floor

In the case of no region of the first sub region analysis of the significance of the reality The hotel ID is stored in a regional database ID the longitude of a data array of the latitude of a data array. ID is divided into 100 levels (in accordance with the size of the latitude and longitude) The original data in accordance with latitude and longitude plus one kilometer data search for the corresponding level of latitude to search the corresponding level of latitude Take intersection You can

U013047426U01304742612:00 07-28
Grade T1 The 69 floor

Can also be used with 2 KEY a HashMap is the latitude of a KEY is the longitude to find the corresponding interval to take the intersection

One Zero Reply share
Img
Take away
ImgEven a small step
Also want to share with you
open
Img