Wednesday, January 13, 2010

duplicate entries in search results

A listener pointed out the duplicate entries which were being thrown up in the search results, which we knew about, but had not yet fixed. The initial entries to the files table had white-space before the filename, this was making them "distinct" from the newer entries of the same files without the whitespace. Trying to find them using this technique,
select c1.fileid, c2.fileid, c1.filename
from `radiosai_file_master` c1, `radiosai_file_master` c2
where c1.fileid < filename =" c2.filename
did not work, since the filenames were actually not exactly the same: this was throwing up filenames which had .mp3 and .MP3 extensions. Then tried with a wildcard before the filename, found it matched. Then, making queries of the form
 SELECT *  FROM `radiosai_file_master` WHERE
`fileName` LIKE CONVERT(_utf8 '__SPECIAL%' USING latin1) COLLATE latin1_swedish_ci
(note the two underscores - meaning two characters) found that there are two white-space chars before the filename. CR and LF is my guess. Couldn't find the way to properly search with an SQL Like statement for "Like 'CR+LF+%', since
 SELECT *  FROM `radiosai_file_master` WHERE
`fileName` LIKE CONVERT(_utf8 '__%' USING latin1) COLLATE latin1_swedish_ci
returned
SELECT *  FROM `radiosai_file_master` WHERE 1
(ha ha!). So, did individual searches for __BV%, __TALK% and so on and deleted those files. Hopefully all are covered now.

No comments:

Post a Comment