Znajdowanie duplikatów wierszy w SQL Server

Mam bazę danych SQL Server organizacji i jest wiele duplikatów wierszy. Chcę uruchomić polecenie select, aby pobrać wszystkie te i ilość dupów, ale także zwrócić identyfikatory powiązane z każdą organizacją.

Wypowiedź typu:

SELECT     orgName, COUNT(*) AS dupes  
FROM         organizations  
GROUP BY orgName  
HAVING      (COUNT(*) > 1)

Zwróci coś w rodzaju

orgName        | dupes  
ABC Corp       | 7  
Foo Federation | 5  
Widget Company | 2 
Ale chciałbym też zabrać ich dokumenty. Jest na to jakiś sposób? Może jak
orgName        | dupeCount | id  
ABC Corp       | 1         | 34  
ABC Corp       | 2         | 5  
...  
Widget Company | 1         | 10  
Widget Company | 2         | 2  

Powodem jest to, że istnieje również oddzielna tabela użytkowników ten link do tych organizacji, i chciałbym je ujednolicić (dlatego usunąć dupes więc użytkownicy link do tej samej organizacji zamiast dupe orgs). Ale chciałbym część ręcznie, więc nie spieprzyć niczego, ale nadal potrzebuję oświadczenie zwracające identyfikatory wszystkich dupe orgs, więc mogę przejść przez listę użytkowników.

Author: ROMANIA_engineer, 2010-01-21

16 answers

select o.orgName, oc.dupeCount, o.id
from organizations o
inner join (
    SELECT orgName, COUNT(*) AS dupeCount
    FROM organizations
    GROUP BY orgName
    HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName
 288
Author: RedFilter,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-01-21 20:32:02

Możesz uruchomić następujące zapytanie i znaleźć duplikaty za pomocą max(id) i usunąć te wiersze.

SELECT orgName, COUNT(*), Max(ID) AS dupes 
FROM organizations 
GROUP BY orgName 
HAVING (COUNT(*) > 1)

Ale będziesz musiał uruchomić to zapytanie kilka razy.

 83
Author: Aykut Akıncı,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-05-22 11:01:18

Możesz to zrobić tak:

SELECT
    o.id, o.orgName, d.intCount
FROM (
     SELECT orgName, COUNT(*) as intCount
     FROM organizations
     GROUP BY orgName
     HAVING COUNT(*) > 1
) AS d
    INNER JOIN organizations o ON o.orgName = d.orgName

Jeśli chcesz zwrócić tylko te rekordy, które można usunąć (pozostawiając po jednym z nich), możesz użyć:

SELECT
    id, orgName
FROM (
     SELECT 
         orgName, id,
         ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY id) AS intRow
     FROM organizations
) AS d
WHERE intRow != 1

Edit: SQL Server 2000 nie posiada funkcji ROW_NUMBER (). Zamiast tego możesz użyć:

SELECT
    o.id, o.orgName, d.intCount
FROM (
     SELECT orgName, COUNT(*) as intCount, MIN(id) AS minId
     FROM organizations
     GROUP BY orgName
     HAVING COUNT(*) > 1
) AS d
    INNER JOIN organizations o ON o.orgName = d.orgName
WHERE d.minId != o.id
 29
Author: Paul,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-01-21 21:48:45

Rozwiązanie oznaczone jako poprawne nie działa dla mnie, ale znalazłem odpowiedź, która działa po prostu świetnie: Pobierz listę duplikatów wierszy w MySql

SELECT n1.* 
FROM myTable n1
INNER JOIN myTable n2 
ON n2.repeatedCol = n1.repeatedCol
WHERE n1.id <> n2.id
 9
Author: ecairol,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-23 11:47:23

Możesz tego spróbować, tak będzie najlepiej dla ciebie

 WITH CTE AS
    (
    SELECT *,RN=ROW_NUMBER() OVER (PARTITION BY orgName ORDER BY orgName DESC) FROM organizations 
    )
    select * from CTE where RN>1
    go
 8
Author: code save,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-11-07 08:19:18
Select * from (Select orgName,id,
ROW_NUMBER() OVER(Partition By OrgName ORDER by id DESC) Rownum
From organizations )tbl Where Rownum>1

Więc rekordy z rowum> 1 będą zduplikowanymi rekordami w Twojej tabeli. "Partition by" najpierw grupuje płyty, a następnie serializuje je, nadając im numery seryjne. Więc rownum > 1 będzie duplikatem rekordów, które mogą być usunięte jako takie.

 3
Author: Mike Clark,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-03-10 05:58:02
select * from [Employees]

Do znajdowania zduplikowanego rekordu 1) Korzystanie z CTE

with mycte
as
(
select Name,EmailId,ROW_NUMBER() over(partition by Name,EmailId order by id) as Duplicate from [Employees]
)
select * from mycte

2) Za Pomocą GroupBy

select Name,EmailId,COUNT(name) as Duplicate from  [Employees] group by Name,EmailId 
 3
Author: Debendra Dash,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-03-09 05:22:22
select column_name, count(column_name)
from table_name
group by column_name
having count (column_name) > 1;

Src: https://stackoverflow.com/a/59242/1465252

 2
Author: iCrazybest,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2017-05-23 11:54:46
select a.orgName,b.duplicate, a.id
from organizations a
inner join (
    SELECT orgName, COUNT(*) AS duplicate
    FROM organizations
    GROUP BY orgName
    HAVING COUNT(*) > 1
) b on o.orgName = oc.orgName
group by a.orgName,a.id
 2
Author: user5336758,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2015-09-15 07:04:10

Jeśli chcesz usunąć duplikaty:

WITH CTE AS(
   SELECT orgName,id,
       RN = ROW_NUMBER()OVER(PARTITION BY orgName ORDER BY Id)
   FROM organizations
)
DELETE FROM CTE WHERE RN > 1
 2
Author: akd,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2016-06-17 09:51:06
select orgname, count(*) as dupes, id 
from organizations
where orgname in (
    select orgname
    from organizations
    group by orgname
    having (count(*) > 1)
)
group by orgname, id
 1
Author: Jordão,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2010-01-21 21:06:15

Masz kilka możliwości wyboru duplicate rows.

Dla moich rozwiązań, najpierw rozważ tę tabelę na przykład

CREATE TABLE #Employee
(
ID          INT,
FIRST_NAME  NVARCHAR(100),
LAST_NAME   NVARCHAR(300)
)

INSERT INTO #Employee VALUES ( 1, 'Ardalan', 'Shahgholi' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 2, 'name1', 'lname1' );
INSERT INTO #Employee VALUES ( 3, 'name2', 'lname2' );
INSERT INTO #Employee VALUES ( 4, 'name3', 'lname3' );

Pierwsze rozwiązanie:

SELECT DISTINCT *
FROM   #Employee;

WITH #DeleteEmployee AS (
                     SELECT ROW_NUMBER()
                            OVER(PARTITION BY ID, First_Name, Last_Name ORDER BY ID) AS
                            RNUM
                     FROM   #Employee
                 )

SELECT *
FROM   #DeleteEmployee
WHERE  RNUM > 1

SELECT DISTINCT *
FROM   #Employee

Rozwiązanie Secound : użyj pola identity

SELECT DISTINCT *
FROM   #Employee;

ALTER TABLE #Employee ADD UNIQ_ID INT IDENTITY(1, 1)

SELECT *
FROM   #Employee
WHERE  UNIQ_ID < (
    SELECT MAX(UNIQ_ID)
    FROM   #Employee a2
    WHERE  #Employee.ID = a2.ID
           AND #Employee.FIRST_NAME = a2.FIRST_NAME
           AND #Employee.LAST_NAME = a2.LAST_NAME
)

ALTER TABLE #Employee DROP COLUMN UNIQ_ID

SELECT DISTINCT *
FROM   #Employee

And end of all solution use this command

DROP TABLE #Employee
 1
Author: Ardalan Shahgholi,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2013-11-07 07:15:12

I think i know what you need musiałem wymieszać odpowiedzi i myślę, że mam rozwiązanie, które chciał:

select o.id,o.orgName, oc.dupeCount, oc.id,oc.orgName
from organizations o
inner join (
    SELECT MAX(id) as id, orgName, COUNT(*) AS dupeCount
    FROM organizations
    GROUP BY orgName
    HAVING COUNT(*) > 1
) oc on o.orgName = oc.orgName

Posiadanie max id da ci id dublicate i jeden z oryginału, o który prosił:

id org name , dublicate count (missing out in this case) 
id doublicate org name , doub count (missing out again because does not help in this case)

Tylko smutna rzecz, którą można zgasić w tej formie

id , name , dubid , name

Hope it still helps

 0
Author: Arthur Kielbasa,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-10-01 08:13:03

Załóżmy, że mamy tabelę 'Student' z 2 kolumnami:

  • student_id int
  • student_name varchar

    Records:
    +------------+---------------------+
    | student_id | student_name        |
    +------------+---------------------+
    |        101 | usman               |
    |        101 | usman               |
    |        101 | usman               |
    |        102 | usmanyaqoob         |
    |        103 | muhammadusmanyaqoob |
    |        103 | muhammadusmanyaqoob |
    +------------+---------------------+
    

Teraz chcemy zobaczyć duplikaty rekordów Użyj tego zapytania:

select student_name,student_id ,count(*) c from student group by student_id,student_name having c>1;

+---------------------+------------+---+
| student_name        | student_id | c |
+---------------------+------------+---+
| usman               |        101 | 3 |
| muhammadusmanyaqoob |        103 | 2 |
+---------------------+------------+---+
 0
Author: Usman Yaqoob,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-02-09 07:32:00

Mam lepszą opcję, aby uzyskać zduplikowane rekordy w tabeli

SELECT x.studid, y.stdname, y.dupecount
FROM student AS x INNER JOIN
(SELECT a.stdname, COUNT(*) AS dupecount
FROM student AS a INNER JOIN
studmisc AS b ON a.studid = b.studid
WHERE (a.studid LIKE '2018%') AND (b.studstatus = 4)
GROUP BY a.stdname
HAVING (COUNT(*) > 1)) AS y ON x.stdname = y.stdname INNER JOIN
studmisc AS z ON x.studid = z.studid
WHERE (x.studid LIKE '2018%') AND (z.studstatus = 4)
ORDER BY x.stdname

Wynik powyższego zapytania pokazuje wszystkie zduplikowane nazwy z unikalnymi identyfikatorami ucznia i liczbą zduplikowanych wystąpień

Kliknij tutaj, aby zobaczyć wynik sql

 0
Author: SoftIdea,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2018-06-25 09:40:30

Try

SELECT orgName, id, count(*) as dupes
FROM organizations
GROUP BY orgName, id
HAVING count(*) > 1;
 -2
Author: ryan,
Warning: date(): Invalid date.timezone value 'Europe/Kyiv', we selected the timezone 'UTC' for now. in /var/www/agent_stack/data/www/doraprojects.net/template/agent.layouts/content.php on line 54
2014-01-22 06:49:56