# 探索台灣郵局郵遞區號資料庫 # Exploring the Taiwan post code database # Copyright : http://www.fsf.org/copyleft/gpl.html # Author : 積丹尼 Dan Jacobson -- http://jidanni.org/ # Created On : Oct 2006 # Last Modified On: Fri Sep 17 02:58:04 2021 # Update Count : 59 See #2018 update at bottom. 今已六碼區號, 我未再查。 (Now codes are six digits long. I didn't check again.) [See http://en.wikipedia.org/wiki/Administrative_divisions_of_the_Republic_of_China for 2010/12/25 Taiwan administrative districts reorganization, not fully reflected in the text below.] 自郵局索回郵遞區號光碟,分析如下 Below is my examination of the free postal code CDROM available at Taiwan post offices. It goes beyond the free 500 page paper3+2 postal code book. Both are in Chinese. Also see http://www.post.gov.tw/ . Maybe the CDROM's contents can be downloaded. [2009/12: Yes, now http://www.post.gov.tw/ 's download area (下載區) is the only way to get these files, as they aren't making new CDROMs 已 不再製片 anymore.] 據光碟版本 The version of the CDROM I used was $ grep 'App Version' /cdrom/WEBSETUP/setup.stf App Version 2006-03-28 15:29:12 今試以應用 Let's solve some problems. 末段始提起載入過程。 How we loaded the data into MySQL is covered at the end. 敬告: 我懂得 MySQL 不多,以下則亂試。 Note: I am a MySQL beginner just guessing what to try. 友人求: Mark> Basically, what I want is a list of all the street names in Taiwan Mark> (Hanzi, Pinyin) and another list of all of the street names (again, with Mark> Hanzi and Pinyin) within individual cities/townships. Your latter request, mysql> SELECT DISTINCT CITY,AREA,ROAD,HROAD FROM rall1 ORDER BY ZIP3A,ROAD LIMIT 3; +-----------+-----------+-----------+------------+ | CITY | AREA | ROAD | HROAD | +-----------+-----------+-----------+------------+ | 台北市 | 文山區 | 一壽街 | Yishou St. | | 台北市 | 北投區 | 一德街 | Yide St. | | 台北市 | 北投區 | 一心路 | Yixin Rd. | +-----------+-----------+-----------+------------+ Your former request, mysql> SELECT DISTINCT ROAD,HROAD FROM rall1 ORDER BY ROAD,HROAD LIMIT 3; +-----------+-------------+ | ROAD | HROAD | +-----------+-------------+ | 一中街 | Yizhong St. | | 一坑 | Yi Valley | | 一坑路 | Yikeng Rd. | +-----------+-------------+ Notes: Special sorting field ZIP3A is e.g., 100 for ALL of 台北市. I didn't filter out non-streets like 一支局 1st Branch Post Office 漢語拼音 HROAD 一兩筆竟然遭「通用拼音」之污染! The Hanyu Pinyin HROAD field has a few ("Tongyong",feh!) pollution instances: mysql> SELECT DISTINCT CITY,AREA,ROAD,HROAD FROM rall1 WHERE HROAD REGEXP 'wun[^aeiou]'; +-----------+-----------+--------------------+--------------------------+ | CITY | AREA | ROAD | HROAD | +-----------+-----------+--------------------+--------------------------+ | 新竹縣 | 新埔鎮 | 文山路亞東段 | Yadong Sec., Wunshan Rd. | +-----------+-----------+--------------------+--------------------------+ "Wunshan"? I'm gonna puke. Is that what his (Yu BQ) momma taught him? 隔音符號 And we note they use "-":西安路 Xi-an Rd. * 收音機說郝龍斌住台北市[fh]ulin[g]?路,聽不清楚。 The radio says Candidate Hao lives on [fh]ulin[g]? Rd. in Taipei. Didn't hear it too clear. Which road(s) could that be? mysql> SELECT DISTINCT ROAD,HROAD FROM rall1 WHERE CITY='台北市' AND HROAD REGEXP '^[fh]ulin'; +-----------+-----------+ | ROAD | HROAD | +-----------+-----------+ | 虎林街 | Hulin St. | | 福林路 | Fulin Rd. | +-----------+-----------+ Imagine trying to guess them via 筆劃 indexes! * The MySQL documents mention functions A SOUNDS LIKE B, using SOUNDEX(), could we have used that? mysql> SELECT SOUNDEX('fulin'),SOUNDEX('fuling'),SOUNDEX('hulin'),SOUNDEX('huling')\G SOUNDEX('fulin'): F450 SOUNDEX('fuling'): F452 SOUNDEX('hulin'): H450 SOUNDEX('huling'): H452 mysql> SELECT SOUNDEX('fulin'), SOUNDEX('fulinn')\G SOUNDEX('fulin'): F450 SOUNDEX('fulinn'): F450 So it seems more tuned to English than Hanyu Pinyin. I'll stick to REGEXP. * 此路口屬何地? We overhear the bad guys are supposed to meet at the corner of 中正東路 and 仁德街, but what city? mysql> SELECT r1.ZIPCODE,r1.AREA FROM rall1 AS r1, rall1 AS r2 -> WHERE r1.ZIPCODE=r2.ZIPCODE AND r1.ROAD='中正東路' AND r2.ROAD='仁德街'; +---------+-----------+ | ZIPCODE | AREA | +---------+-----------+ | 30241 | 竹北市 | +---------+-----------+ 可惜地圖業者不加註郵遞區號 To bad map publishers don't add postcodes. 見信封再見地圖,免涉電腦。 One could look at an envelope, and then a map, no computer needed. * 此押花店靠哪裡? The pressed flower shop is at 台中市西區忠仁街115號, what better known roads is that near? mysql> SELECT DISTINCT ROAD FROM rall1 WHERE ZIPCODE= -> (SELECT ZIPCODE FROM rall1 WHERE CITY='台中市' and ROAD='忠仁街') LIMIT 3; +-----------+ | ROAD | +-----------+ | 五權路 | | 中山路 | | 中正路 | +-----------+ or can use SELECT DISTINCT r1.ROAD FROM rall1 as r1, rall1 as r2 WHERE r1.ZIPCODE=r2.ZIPCODE AND r1.CITY='台中市' and r2.ROAD='忠仁街'; (As I am a MySQL beginner, I just experiment.) * 何路名最長? What are the longest street names? mysql> SELECT DISTINCT AREA,ROAD,HROAD FROM rall1 ORDER BY CHARACTER_LENGTH(ROAD) DESC LIMIT 3; +-----------+-----------------------------------+----------------------------------------+ | AREA | ROAD | HROAD | +-----------+-----------------------------------+----------------------------------------+ | 太平市 | 新平路3段部仔二十五巷 | Buzai 25th Lane, Sec. 3, Xinping Rd. | | 信義鄉 | 玉山國家公園塔塔加旅遊 | TatajiaTour, Yushang National Park | | 芳苑鄉 | 建平村二溪路草2段 | Cao Sec. 2, Erxi Rd., Jianping Village | +-----------+-----------------------------------+----------------------------------------+ * 些短名 Some short names, mysql> SELECT DISTINCT ROAD FROM rall1 WHERE ROAD REGEXP '^...(街|路)$' ORDER BY ROAD LIMIT 3; +--------+ | ROAD | +--------+ | 下街 | | 下路 | | 丹路 | +--------+ We see three characters, "...", represent one Chinese UTF-8 character in REGEXP (but not in CHARACTER_LENGTH below.) * 何後綴最常見? What are the most common suffixes? mysql> SELECT DISTINCT RIGHT(ROAD,1) AS suffix, COUNT(*) AS num FROM rall1 -> WHERE CHARACTER_LENGTH(ROAD)>2 GROUP BY suffix ORDER BY num DESC LIMIT 4; +--------+-------+ | suffix | num | +--------+-------+ | 路 | 22521 | | 街 | 16703 | | 段 | 7238 | | 巷 | 4459 | +--------+-------+ * 含大道之鄉鎮市 Towns with boulevards: mysql> SELECT COUNT(DISTINCT AREA1) FROM rall1 WHERE ROAD REGEXP '.大道'\G COUNT(DISTINCT AREA1): 11 Many override Hanyu Pinyin with e.g., "Ketagalan". * 長之頻 Frequency of lengths, mysql> SELECT DISTINCT CHARACTER_LENGTH(ROAD) AS len,COUNT(*) FROM rall1 GROUP BY len; +-----+----------+ | len | count(*) | +-----+----------+ | 2 | 2295 | | 3 | 35982 | | 4 | 10858 | | 5 | 6927 | | 6 | 2373 | | 7 | 212 | | 8 | 91 | | 9 | 20 | | 11 | 2 | +-----+----------+ (E.g., 3: 中正路, three characters long.)) * 非所有種類地名尋得 Not every kind of place name is indexed: mysql> SELECT COUNT(*) FROM rall1 where CITY='台北市' AND ROAD REGEXP '西門'; +----------+ | COUNT(*) | +----------+ | 0 | +----------+ 休想如 So forget about e.g., 西門町. 僅收正址 Only official addresses components are indexed. * 最常見路等等名為何? What are the most common road etc. names? mysql> SELECT DISTINCT LEFT(ROAD,2) AS name,COUNT(*) FROM rall1 GROUP BY name ORDER BY COUNT(*) DESC LIMIT 6; +--------+----------+ | name | COUNT(*) | +--------+----------+ | 中山 | 1297 | | 中正 | 1067 | | 中華 | 549 | | 民生 | 526 | | 中興 | 481 | | 忠孝 | 456 | +--------+----------+ * 同市鄉鎮有何異路同拼? In the same municipality, what different character roads have the same Hanyu Pinyin? mysql> SELECT (COUNT(DISTINCT AREA1,ROAD,HROAD))AS UniqRoad, COUNT(DISTINCT AREA1,HROAD)AS UniqPin FROM rall1\G UniqRoad: 38649 UniqPin: 38494 mysql> SELECT 38649 - 38494\G 38649 - 38494: 155 (We show zips for multi zip cities. Program ./u is at bottom of page.) $ echo "SELECT DISTINCT AREA1,HROAD,ROAD,ZIP3A,SUBSTRING(ZIPCODE,1,3) FROM rall1 ORDER BY ZIP3A,HROAD,ROAD;"|mysql -N|./u|tee /dev/tty|wc -l 台北市 Jinghua St. 景化街 景華街 103 116 台北市 Tong-an St. 同安街 通安街 100 106 台北市 Wanqing St. 萬慶街 萬青街 116 108 台北市 Xiangyang Rd. 向陽路 襄陽路 115 100 台北市 Xingzhong Rd. 興中路 行忠路 115 114 貢寮鄉 Fuxing St. 復興街 福興街 土城市 Guangming St. 光明街 廣明街 三重市 Fuhua St. 富華街 福華街 三重市 Renhua St. 仁化街 仁華街 頭城鎮 Minfeng Rd. 民峰路 民鋒路 礁溪鄉 Malin Rd. 瑪僯路 瑪璘路 冬山鄉 Fuxing Rd. 復興路 福興路 冬山鄉 Jiadong Rd. 茄冬路 茄苳路 蘇澳鎮 Datong Rd. 大同路 大通路 蘇澳鎮 Shenzhou Rd. 神州路 神洲路 新竹市 Yumin St. 育民街 裕民街 關西鎮 Jinshan Li 金山里 錦山里 芎林鄉 Wenhua St. 文化街 文華街 中壢市 Fuhua St. 復華街 福華街 中壢市 Xingfu St. 幸福街 興福街 中壢市 Zhongxin Rd. 中新路 忠信路 中壢市 Zhongyuan Rd. 中原路 中園路 龍潭鄉 Longyuan Rd. 龍元路 龍源路 桃園市 Guangwen St. 光文街 廣文街 龜山鄉 Hou St. 后街 後街 龜山鄉 Wenshan St. 文善街 文山街 苗栗市 Wenfu 文富 文福 苗栗市 Zhong-ai 中愛 忠愛 造橋鄉 Shuangheshan 雙合山 雙鶴山 頭屋鄉 Fanzailiao 番仔寮 蕃仔寮 銅鑼鄉 Zhongxing St. 中興街 忠興街 台中市 Dafu St. 大富街 大福街 408 408 台中市 Datong St. 大同街 大通街 403 408 台中市 Daxin St. 大信街 大新街 407 403 台中市 Furen St. 福人街 福仁街 403 401 台中市 Futai St. 富台街 福泰街 401 407 台中市 Fuxin St. 復新街 福新街 402 402 台中市 Guangfu Rd. 光復路 廣福路 400 407 台中市 Jianxing Rd. 健行路 建興路 403 408 太平市 Fuxing St. 復興街 福興街 太平市 Yixin St. 宜信街 宜欣街 大里市 Daming Rd. 大明路 達明路 大里市 Lixin St. 立信街 立新街 外埔鄉 Shanjiao Lane 山腳巷 山角巷 大安鄉 Xizhou Rd. 溪州路 溪洲路 鹿港鎮 Dongshi Lane 東勢巷 東石巷 鹿港鎮 Xinsheng St. 新生街 新盛街 鹿港鎮 Yangcuo Lane 楊厝巷 洋厝巷 鹿港鎮 Yumin St. 育民街 裕民街 福興鄉 Fuxing Rd. 復興路 福興路 福興鄉 Xianian St. 下粘街 廈粘街 和美鎮 Jiadian Rd. 加佃路 嘉佃路 和美鎮 Zhongshan Rd. 中山路 忠善路 永靖鄉 Zhuxi Lane 竹溪巷 竹西巷 溪湖鎮 Fuli St. 富麗街 福利街 田中鎮 Xinsheng St. 新生街 新聖街 田尾鄉 Fengtian Rd. 豐田路 鳳田路 埤頭鄉 Jian-an St. 健安街 建安街 溪州鄉 Xidou Rd. 溪斗路 西斗路 芳苑鄉 Chuanhe Rd. 川合路 川和路 二水鄉 Wubo Lane 五伯巷 伍伯巷 埔里鎮 Fuxing Rd. 復興路 福興路 嘉義市 Datong Rd. 大同路 大統路 嘉義市 Xinxing St. 信興街 新興街 竹崎鄉 Fujin 復金 覆金 竹崎鄉 Juezai 堀仔 掘仔 新港鄉 Xinbuluo 新步落 新部落 斗六市 Fuxing Rd. 復興路 福興路 斗六市 Wuchang St. 五常街 武昌街 斗六市 Xinghe Rd. 興合路 興和路 台南市 Anyi Rd. 安億路 安義路 708 709 台南市 Fude St. 富德街 福德街 701 704 台南市 Jing-an St. 敬安街 景安街 709 709 台南市 Yude 1st St. 育德一街 裕德一街 704 701 台南市 Yude 3rd St. 育德三街 裕德三街 704 701 台南市 Yude Rd. 育德路 裕德路 704 701 台南市 Yuping 1st St. 育平一街 裕平一街 708 701 台南市 Yuping Rd. 育平路 裕平路 708 701 台南市 Yuying St. 育英街 裕英街 709 701 永康市 Xinxing St. 新興街 新行街 永康市 Yongzhong St. 永中街 永忠街 將軍鄉 Kunshen Village 鯤身村 鯤鯓村 高雄市 Changsheng St. 常盛街 長生街 811 801 高雄市 Chongjing St. 崇敬街 重景街 812 813 高雄市 Dade St. 大德街 達德街 807 812 高雄市 Dali St. 大利街 大禮街 807 803 高雄市 Dewei St. 德威街 德維街 813 811 高雄市 Fuxing St. 復興街 福興街 800 811 高雄市 Haichengwaihai Rd. 海城外海路 海澄外海路 812 812 高雄市 Hongzhi St. 宏志街 宏智街 812 812 高雄市 Jinfu Rd. 金府路 金福路 812 806 高雄市 Liming Rd. 禮明路 立明路 807 813 高雄市 Liren St. 立人街 立仁街 807 811 高雄市 Lizhong Rd. 立中路 立忠路 813 807 高雄市 Nansheng St. 南聖街 楠盛街 806 811 高雄市 Qingfeng St. 慶豐街 清風街 青峰街 804 807 804 高雄市 Qingyun St. 慶雲街 青雲街 807 804 高雄市 Ruiping Rd. 瑞屏路 瑞平路 811 811 高雄市 Shangzhi St. 尚志街 尚智街 802 800 高雄市 Songfu St. 松富街 松福街 812 812 高雄市 Tai-an St. 台安街 泰安街 807 807 高雄市 Wenzhi Rd. 文智路 文直路 813 813 高雄市 Xiling St. 西凌街 西陵街 813 813 高雄市 Xinchang St. 信昌街 新昌街 811 811 高雄市 Xinghua St. 興化街 興華街 806 803 高雄市 Yide Rd. 一德路 義德路 806 807 高雄市 Yili St. 伊犁街 義利街 804 807 高雄市 Yiyong Rd. 義勇路 義永路 802 807 高雄市 Yongchang St. 勇昌街 永昌街 811 802 高雄市 Yongfu St. 永富街 永福街 802 802 高雄市 Yufeng St. 玉豐街 裕豐街 812 804 高雄市 Zhengde Rd. 政德路 正德路 813 804 高雄市 Zhongxin Rd. 中信路 中心路 忠信路 813 812 813 仁武鄉 Chengxin St. 澄信街 澄新街 仁武鄉 Fude St. 富德街 福德街 仁武鄉 Renxin Lane 仁信巷 仁新巷 大社鄉 Wenhua Rd. 文化路 文華路 岡山鎮 Dacheng St. 大埕街 大成街 岡山鎮 Wanli Rd. 灣裡路 灣里路 橋頭鄉 Shilong N. Lane 仕隆北巷 仕龍北巷 梓官鄉 Tong-an Rd. 同安路 通安路 鳳山市 Nanfu St. 南富街 南福街 大寮鄉 Fulong St. 富隆街 福隆街 林園鄉 Xizhou 3rd Rd. 溪州三路 溪洲三路 旗山鎮 Fuxing Lane 富興巷 復興巷 美濃鎮 Guangfu St. 光復街 廣福街 金沙鎮 Yangzhai 陽宅 陽翟 金湖鎮 Houlong 后壟 後壟 金湖鎮 Xiazhuangzhongxing Rd. 下庄中興路 下莊中興路 屏東市 Gongyi St. 公益街 公義街 屏東市 Guangxin Lane 光信巷 光心巷 里港鄉 Zhongxiao Rd. 中校路 忠孝路 高樹鄉 Guangfu Rd. 光復路 廣福路 竹田鄉 Xinshi Rd. 信實路 新勢路 內埔鄉 Jianfu Rd. 建富路 建復路 內埔鄉 Lidong Rd. 里東路 黎東路 內埔鄉 Lixin Rd. 里信路 里新路 內埔鄉 Renhe Rd. 人和路 仁和路 內埔鄉 Xindong Rd. 信東路 新東路 內埔鄉 Xinsheng Rd. 新勝路 新生路 內埔鄉 Zili Rd. 自力路 自立路 萬丹鄉 Xinfu Rd. 新富路 新福路 潮州鎮 Fuxing Rd. 復興路 福星路 萬巒鄉 Fuxing Rd. 富興路 復興路 林邊鄉 Tifang Rd. 堤防路 提防路 恆春鎮 Eluan Rd. 鵝鑾路 鵝鸞路 恆春鎮 Kengnei Rd., Eluan Li 鵝鑾里坑內路 鵝鸞里坑內路 滿州鄉 Fuxing Rd. 復興路 福興路 關山鎮 Zhongqing Rd. 中慶路 忠慶路 吉安鄉 Fuxing 1st St. 復興一街 福興一街 吉安鄉 Fuxing 2nd St. 復興二街 福興二街 吉安鄉 Fuxing 3rd St. 復興三街 福興三街 吉安鄉 Fuxing 4th St. 復興四街 福興四街 153 153 vs. 155... Oh, those are the triples of 高雄市 above: Qingfeng St., Zhongxin Rd. * 五碼得辨之? Do the 5 digit zipcode differentiate them? CREATE TABLE t1 (INDEX(HROAD,AREA1)) TYPE=MEMORY SELECT DISTINCT AREA1,HROAD,ROAD,ZIPCODE FROM rall1; SELECT a.ZIPCODE,a.AREA1,a.HROAD,a.ROAD,b.ROAD FROM t1 AS a JOIN t1 AS b USING(AREA1,HROAD,ZIPCODE) WHERE a.ROAD != b.ROAD GROUP BY HROAD ORDER BY a.ZIPCODE,a.HROAD LIMIT 4; #misses third items though DROP TABLE t1; ZIPCODE AREA1 HROAD ROAD ROAD 26144 頭城鎮 Minfeng Rd. 民峰路 民鋒路 26245 礁溪鄉 Malin Rd. 瑪僯路 瑪璘路 27043 蘇澳鎮 Shenzhou Rd. 神州路 神洲路 30073 新竹市 Yumin St. 育民街 裕民街 可能多別字 Hmmm, maybe most are just alternate characters... Indeed, the closer the zipcodes and tones match, the more likely that they are really talking about the same street. Note, 茲不歸重疊 e.g., these we don't consider duplicates: Zhongzheng St.: Zhongzheng Rd.; Fuxing N. Rd. 復興北路: Fuxing Rd. 福興路; Wuchang St. 五常街: Sec. 1, Wuchang St. 武昌街1段; Taizhong City has many Shanxi close calls, but using Shaanxi (two a's is) still not needed to differentiate. 台中市"(陝|陜|山)西"險些動到了 "Shanxi/Shaanxi". For example: mysql> SELECT DISTINCT HROAD,ROAD FROM rall1 WHERE HROAD REGEXP 'Shanxi .\. 1st St\.'; Shanxi N. 1st St. 山西北一街 Shanxi S. 1st St. 山西南一街 Shanxi E. 1st St. 陜西東一街 記得北部看過 Any of these? I swear I saw some on Northern Taiwan house plates: mysql> SELECT ROAD FROM rall1 WHERE ROAD REGEXP '衚|衕|衖|胡同'; Empty set (0.38 sec) Maybe 衖 becomes officially 弄. *** 附錄: 載入 Appendix: getting the data into MySQL etc. There is only 30+ megabytes of files on the CD, 95% empty. 均採 Debian GNU/Linux,我家不用微軟。 惟僅能見內臟,未能執行其光碟程式。 I use Debian GNU/Linux and don't use/have any Microsoft software, hence I just looked at some files I found on and extracted from the CDROM and do not nor cannot see the CDROM programs as they were intended. I used the Debian package cabextract on WEBSETUP/SETUP1.CAB and also noticed WEBSETUP/setup.inf had their file name list. (I moved the files to their proper names with a quick script but the must be a better way.) 載入 MySQL 後一切能以標準 SQL 語操作。 rall1.dbf looks interesting. We import it into MySQL, after which we can use standard SQL statements: $ dbf2mysql -U jidanni -P passwd -c rall1.dbf Example record: mysql> SELECT * FROM rall1 WHERE HROAD REGEXP 'Chongqing' AND SCOOP REGEXP '187'\G *************************** 1. row *************************** ZIPCODE1: ZIPCODE2: OFFICE: 板橋郵局投遞股-23區 ZIP3A: 220 ZIPCODE: 22063 CITY: 台北縣 (now 新北市 2010/12/25) AREA: 板橋市 AREA1: 板橋市 ROAD: 重慶路 SCOOP: 單 187號以下 EVEN: 1 CMP_LABLE: 6 LANE: 0 LANE1: 0 ALLEY: 0 ALLEY1: 0 NO_BGN: 1 NO_BGN1: 0 NO_END: 187 NO_END1: 0 FLOOR: 0 FLOOR1: 0 EXP: ROAD_NO: 452431 ROAD1: 重慶路 EROAD: Chongcing Rd. TROAD: Chongcing Rd. PROAD: Chungching Rd. HROAD: Chongqing Rd. RMK: 0001000000 RKEY: CCL TCODE: CCL HCODE: CQL PCODE: CCL MCODE: YVL ZIP3RMK: T ECITY: Taipei County EAREA: Banciao City ISN: F I am not sure if the alternative pinyins are accessible from the CDROM's user interface, or are just waiting in hiding in case of political changes. 發現資料中些特點 We note some special things in/about the data: mysql> SELECT COUNT(DISTINCT ZIP3A), COUNT(DISTINCT CITY), COUNT(DISTINCT AREA), COUNT(DISTINCT AREA1), -> COUNT(DISTINCT ZIP3A,CITY,AREA,AREA1) FROM rall1\G COUNT(DISTINCT ZIP3A): 329 COUNT(DISTINCT CITY): 26 COUNT(DISTINCT AREA): 359 COUNT(DISTINCT AREA1): 328 COUNT(DISTINCT ZIP3A,CITY,AREA,AREA1): 369 mysql> SELECT DISTINCT CITY,AREA,AREA1 FROM rall1 WHERE AREA=''; +-----------+------+-----------+ | CITY | AREA | AREA1 | +-----------+------+-----------+ | 新竹市 | | 新竹市 | | 嘉義市 | | 嘉義市 | +-----------+------+-----------+ $ echo "SELECT DISTINCT ZIP3A,AREA1 FROM rall1;"|mysql -N|cut -f 2|sort|uniq -d 高雄市 mysql> SELECT COUNT(DISTINCT ROAD,ZIP3A,CITY,AREA,AREA1) FROM rall1 WHERE AREA1='高雄市' AND ZIP3A = 800\G COUNT(DISTINCT ROAD,ZIP3A,CITY,AREA,AREA1): 2187 mysql> SELECT COUNT(DISTINCT ROAD,ZIP3A,CITY,AREA,AREA1) FROM rall1 WHERE AREA1='高雄市' AND ZIP3A != 800\G COUNT(DISTINCT ROAD,ZIP3A,CITY,AREA,AREA1): 4 mysql> SELECT DISTINCT ROAD,ZIP3A,CITY,AREA,AREA1 FROM rall1 WHERE AREA1='高雄市' AND ZIP3A != 800; +--------------+-------+-----------+-----------+-----------+ | ROAD | ZIP3A | CITY | AREA | AREA1 | +--------------+-------+-----------+-----------+-----------+ | 中山二路 | 806 | 高雄市 | 前鎮區 | 高雄市 | | 管仲南路 | 806 | 高雄市 | 前鎮區 | 高雄市 | | 鄭和南路 | 806 | 高雄市 | 前鎮區 | 高雄市 | | 九如二路 | 807 | 高雄市 | 三民區 | 高雄市 | +--------------+-------+-----------+-----------+-----------+ 此四筆之 ZIP3A 恐誤, they should be 800. This will slightly disturb the sorting I gave Mark above. This explains the 328 vs. 329 discrepancy above. Now ran "gaoxiong_fix" below. Wait, there's still things like $ echo "SELECT DISTINCT AREA1,ZIP3A,SUBSTRING(ZIPCODE,1,3) AS zip3 FROM rall1 HAVING ZIP3A != zip3 AND (ZIP3A % 100 OR AREA1 REGEXP '區');"|mysql AREA1 ZIP3A zip3 新莊市 242 248 新市鄉 744 741 旗津區 800 817 mysql> SELECT DISTINCT EXP FROM rall1 WHERE EXP NOT REGEXP '大宗|^$|外造字' ORDER BY EXP DESC LIMIT 2; +-----------------+ | EXP | +-----------------+ | (蛇仔巃) | | 重疊?? | +-----------------+ Amongst other files found, dict.dbf repeats some items from rall1.dbf including non-ideal 第一橫街 1st Side St. Which I would call 1st St., but I live here (台中市東勢區). There are quite a few probably worthy files on the disk (after unzipping the .CAB file) but I mainly concentrated on the biggest, rall1.dbf. Parts of some other files seen: $ dbfdump -r cmp.dbf|head|perl -anwe 'print "$F[2]\t$F[1]\n"'|iconv -f big5 CODE CMPL 165 鄰內單一附號 166 鄰內起門牌號迄附號 167 鄰內起附號迄門牌號 168 鄰內起附號迄附號 153 巷弄內單一附號 154 巷弄內單號全 155 巷弄內雙號全 156 巷弄內起門牌號迄附號(連) 157 巷弄內起門牌號迄附號(單) (Must be rall1's CMP_LABLE field, (that I should just do ord() on below, if the table is ready for that...)) * 大宗郵件收受戶 large organizations with their own zipcode $ dbf_dump -fields NAME bigc.dbf|iconv -f big5|perl -C -lnwe '/..$/&&print $&'|sort|uniq -c|sort -nr|head 59 公司 47 大學 38 學院 22 醫院 19 政府 19 分局 10 徵處 8 大樓 8 中心 7 察局 mysql> SELECT COUNT(*) FROM rall1 WHERE EXP REGEXP '^大宗-'\G count(*): 416 聚何地? Where are those organizations concentrated? mysql> SELECT COUNT(*),AREA1 FROM rall1 WHERE EXP REGEXP '^大宗-' GROUP BY AREA1 ORDER BY COUNT(*) DESC LIMIT 4; +----------+-----------+ | COUNT(*) | AREA1 | +----------+-----------+ | 118 | 台北市 | | 26 | 台中市 | | 18 | 新莊市 | | 14 | 中和市 | +----------+-----------+ * 最多後綴 What are the most common suffixes? mysql> SELECT SUBSTRING(ZIPCODE,4,2) AS suffix,COUNT(*) FROM rall1 -> WHERE EXP REGEXP '^大宗-' GROUP BY suffix ORDER BY COUNT(*) DESC LIMIT 6; +--------+----------+ | suffix | COUNT(*) | +--------+----------+ | 01 | 72 | | 02 | 50 | | 03 | 46 | | 05 | 29 | | 04 | 27 | | 06 | 23 | +--------+----------+ * 其餘之? What are the most common suffixes of the rest? mysql> SELECT SUBSTRING(ZIPCODE,4,2) AS suffix,COUNT(*) FROM rall1 -> WHERE EXP NOT REGEXP '^大宗-' GROUP BY suffix ORDER BY COUNT(*) DESC LIMIT 5; +--------+----------+ | suffix | COUNT(*) | +--------+----------+ | 41 | 4848 | | 42 | 4241 | | 43 | 3934 | | 44 | 3357 | | 45 | 3099 | +--------+----------+ # 尋找同鄉鎮區同時存在某路無段及有段的, # 例如台中市北屯區崇德六路、崇德六路一段。 # 尋有百多例。有的或許為分段後的殘留。 $ cat Makefile N=mysql -N -r --default-character-set=binary test|iconv -f big5 Q=echo "SELECT DISTINCT city,area,road FROM test WHERE hroad $$n REGEXP '^Sec';"|$N tg:tx ts; grep -Fof tx ts|sort -u tx:;n=NOT;$Q > $@ ts:; $Q > $@ To find more tools, search for "dbf" and "xbase" matches on http://packages.debian.org/ . P.S., how I managed to import a UTF-8 version of rall1.dbf into mysql on Debian GNU/Linux. Makefile snippet: CDB:;echo CREATE DATABASE postoffice CHARACTER SET utf8\;|mysql tt44:rall1.dbf; dbf2mysql -r -d test -c -P $${PAS?} $? #big5 into test db cral:rall1.dbf; dbf2mysql -r -d postoffice -t rall1 -cc -P $${PAS?} $? #empty /tmp/tt55:; echo "SELECT * FROM test;"|mysql -N test \ --default-character-set=binary >$@ /tmp/rall1.txt:/tmp/tt55 perl -F\\t -anlwe \ '$$F[11]=chr(ord($$F[11])%128);print join "\t", @F' $?|\ iconv -f big5|tr -d \\\\ > $@ #big5 功 etc. doubled "\" #$F[11] is CMP_LABLE, which had some non UTF-8 characters that we alter first. tt66:; cd /tmp && mysqlimport --delete -v --local postoffice rall1.txt gaoxiong_fix:;echo "UPDATE rall1 SET ZIP3A=800 WHERE AREA1='高雄市' AND ZIP3A!=800;"|mysql Program u used above: $ cat u #!/usr/bin/perl use strict; use warnings; my ( @F, %count, %ROAD, %ZIP3 ); my $lastsamecheck = ""; while (<>) { chomp; @F = split "\t"; my $samecheck = "@F[0..2]"; my $key = "$F[3]\t$F[0]\t$F[1]"; next if $samecheck eq $lastsamecheck; $count{$key}++; $ROAD{$key} .= "\t$F[2]"; $ZIP3{$key} .= $F[3] !~ /[12478]00/ ? "" : "\t$F[4]"; $lastsamecheck = $samecheck; } for ( sort keys %count ) { if ( $count{$_} > 1 ) { my @o = split "\t", "$_$ROAD{$_}$ZIP3{$_}"; shift @o; print join( "\t", @o ), "\n"; } } Due to lack of SQL experience, I was unable to write a pure SQL version of the above. 針對混亂,參 「中国行政区划网」 #2018 update: On https://www.post.gov.tw/ 's 下載 (download) page we notice and download e.g., (smaller database, just for some all-Chinese road name experiments, not pinyin) $ wget -U X http://download.post.gov.tw/post/download/Zip32_10710.zip $ unzip Zip32_10710.zip $ ssconvert 3+2郵遞區號檔.xls po.csv sqlite3 po.sqlite < $@ $Z 'SELECT DISTINCT $O FROM p AS p1, p AS p2 WHERE p1.$c = p2.$c AND p1.$r = p2.$r AND p1.$d != p2.$d;' same_district_road_names:/tmp/sdrn < $? perl -Mutf8 -C -F\\\| -alnwe 'next if /釣魚臺|沙群島/; $(\ )push @{$$c{$$F[1].$$F[2]}}, $$F[0]; END{$(\ )for(keys %c){print scalar @{$$c{$$_}}, " $$_: ", join "、", sort @{$$c{$$_}};}};'|sort -k 1nr /tmp/sdrn: > $@ $Z 'SELECT DISTINCT $O FROM p AS p1, p AS p2 WHERE p1.$d = p2.$d AND p1.$r = p2.$r AND p1.$c != p2.$c;' longest_names:; $Z 'SELECT DISTINCT $c,$d,$r FROM p ORDER BY length($r) DESC LIMIT 33;'