How Can We Help?

Table of Contents

Tips and Tactics for Gathering DNA Data

← All Topics

Tips and Tactics for Gathering DNA Data

Overview

Many users set out to gather everything possible on their first try only to discover a full gather for FTDNA, Ancestry, or MyHeritage may take hours, days or even weeks. Preplanning how much and which DNA match data to gather with the DNAGedcom Client (DGC) can save time and help your research keep moving even while gathering more data. Luckily, subsequent DGC gathers take a lot less time than the first gather, but getting past that first gather is the topic of this article. Gedmatch and 23andme gathers collect all matches automatically, but because of their smaller databases, complete gathers are generally pretty speedy.

TIP 1If you are new to genetic genealogy or just plain impatient, set  the cM range minimum to 30 cM when you collect Ancestry, FTDNA or MyHeritage kits. This setting will harvest the matches you most need to solve first. The good news is most researchers don’t need to collect every match for every kit to get good results. You will find the skills you build working with the largest matches first are the foundation you will need to work with the trickier smaller segments. Starting with your closest matches (the folks who share the most DNA) is the approach endorsed by experts.

If  you are more experienced you already know solving the big segments first is a must. The good news is  with DNAGedcom you can always go back and collect the rest when or if needed  later, so you can safely get down to work with the results of your large segment gathers. You can even run more, multiple gathers in the background while you run the DNAGedcom cluster programs on  the first batch of data.

By the way, once you get a sense for how long a kit of a certain size is likely to take, you will also know which kits can be gathered in a single session

BUT:  If Tip 1 doesn’t appeal to you for any reason, read on. Because of the time required to gather smaller segments, knowing more about how DGC works and having a “gather plan” tailored  to your research goals will save time and frustration in the long run.  This article aims to help you consider what your options are based on understanding why gathers can be astonishingly slow and how DNAGedcom works. Broadly, we cover:

  1. What you should know about how DNAGedcom’s “gather”  tools to use them efficiently?
  2. How much and which types data do you really need?
  3. How to estimate how long any gather will take
  4. How to  make an incremental gather plan tailored to your kit(s) and research goals?

The Gather Challenge

Tip 2 Understanding why some gathers can be completed in five minutes, others in hours, days or weeks is important if you want to minimize frustration and plan a fool proof gather.

First, the more kits in vendor’s database, the more  matches you are likely to find for your kit, allowing for factors such differences in family sizes over time, being in an ethnic group well represented in a particular or any DNA databases, etc.  Because Ancestry, FTDNA, and MyHeritage have sold the most kits, those are the kits likely to take the longest to gather if you have kits on all five of DNAGedcom’s supported platforms.

Then If you recall that the number of grandparents each generation is the square of the previous generation (e.g. 2,4,16,….) so goes the increase in  matches roughly speaking, This translates to the smaller the total amount of DNA shared the more matches there will be and the longer it will take to gather data.

Here is an example:  Gathering matches, ICW and trees for an Ancestry kit with 40,000 matches took neraly as 20 hours. However, gathering only matches, ICW and trees above 30 cM for the same kit took less than 5  minutes!  The reason for the difference is that more than 90% of the matches were below 30 cM!

 Unfortunately, vendors sometimes impose restrictions on 3rd party software accessing their matches.  MyHeritage compounds the challenge of collecting by slowing every data gather, large or small, by imposing an additional wait time of more than a minute between each match. Even if you only want to download full data for only 10,000 MyHeritage matches, consider if each match takes a minute to collect, plus a minute wait before the next match, gathering those 10,000 matches would take 18 uninterrupted days!  

Still, the  “collect it all now ” strategy can be tempting to users who have the stamina to attend to launching and often restarting an extended gather. But understanding how DGC works is essential since a gather that goes for more than day can easily collide with operating systems updates, power failures and user errors of all sorts. Unless you know how to restart a gather without losing data, avoid this approach. It is not necessary

But why here is a viable alternative: use an incremental gather plan to gather the closest matches first. While you follow the leads in the first gathered matches  you can reset the lower cM to keep gathering  data for matches who share less DNA. Because a subsequent gather with the samer upper limit starts where the previous one left off, the new gather will pick up any new matches above the previous lower limit and then proceed to pick up matches in the new lower range. 

An incremental approach has advantages.   this approach lets you work on matches who are more closely related to your kit even while you extend the lower limit of your gather to matches who share less DNA.. Thus, rather than spending a  week to gather a full Ancestry dataset before you begin to  investigate the relationships among matches, you can run the Autosomal Tools (Collins Leeds Martrix Method and Chromosome Matrix Analysis) on the data you have gathered, then repeating cluster tests as new matches are gathered. Because many smaller matches are often more easily interpreted in the context of the larger matches you have already solved, this approach mitigates the problem of extended gather times while focusing on matches whose most common recent ancestors (MRCAs) are closer in time and have the least equivocal segment sizes.

What  Data Do You Actually Need to Gather?

Tip 3:  Consider how much data you will need to solve your particular research questions

 For example, a person solving a 6-generation tree is likely to make good progress with matches 30  cM or  larger ,but finding good trees at 20 cMs might want to expand the range.   A person solving an adoption within 1 or 2 generations only needs matches in higher total CMs ranges associated with siblings, aunts/uncles, grandparents and great grandparents’ generations. 30 cMs again Is plenty. In contrast   a person systematically seeking to push a tree back as many generations as possible will eventually find a few valuable, provable matches with good trees in the fringes (10-6 cM) if they are willing to do tedious work of gathering thousands of essentially unusable matches at the same time.

For  example, if you are trying to discover from DNA who your maternal grandmother was, collecting matches who share only 10 cMs with you will generally be less productive than focusing on matches who share considerably more DNA since the most recent common ancestors of interest(MRCAs) who are  closer to your study timeframe would be 1st to 4th degree cousins, aunt/uncles, and great-grandparents predicted to share between x and x cms.

 In contrast, iif your goal is to map and paint each segment of shared DNA to its ancestral source, you will wind up working with small segments. This is where planning incremental gathers can be a lifesaver.

TIP 4 Consider data requirements of genetic genealogy software tools you would like to use?

If you follow advances in software tools designed to help untangle and identify segments of DNA, you will know match data are becoming increasingly accessible to automated matrix and network-node cluster analysis, chromosome painting by ancestral line, surname analysis for example are making smaller matches easier to explore. Each application has specific requirements for which types and how much  data you will need to gather  

DNAGedcom and Genetic.Family analysis tools are rapidly expanding.  Are the matches you want most to study with DNAGedcom’s clustering tools going to be picked up in your gather? Are you going to use the Genetic.Family data bridge, if so you will want to be sure to gather every bit of tree data you can.  If you  are going to use DNAPainter’s new Cluster Auto Painter auto-paint feature that uses DNAGedcom Clusters, have  the matches you want see segments painted for been gathered by DNAGedcom prior to running the Collins Leeds Method (CLM) option.

DNAGEDcom supports a number of valuable third party products such as GenomeMate Pro and Rootsfinder DNA Tools. Each has  specific file type requirements to do their magic. Knowing in advance, what is needed can save time gathering in the long run.

TIP 5 Decide how much  risk you can (or should) tolerate

The smaller the segment the higher the probability mistaking a false positive or a pile up type match with a fascinating tree for a valuable clue..  In the first case, it may even be the segment matches neither parent  (Estes, 2020) and  is of no use and in the second case, the segment maybe shared by so many people it may be associated with an endogamous group and not specific surnames or maybe just uninterpretable pileup. To complicate matters further the smaller the segment the more likely you;ll run into differences between chipset  versions that differ in  chromosome regions sampled. 

Still, there is the risk of missing a valuable clue amidst the huge collection of small segments. if you are familiar with critical concepts in genetic genealogy and are  experienced enough to be confident in your methods working with smaller segment, there will be a few  provably valid matches with useful ancestral to be had between 15 cM and 6 cMs. But, even within this range, the larger the segment the safer you are.

These tips beg one more question:  How much do you know about genetic genealogy?  This may seem harsh, but unless you make a commitment to learning the fundamental concepts  and methods required to work with both DNA and genealogical  data, you will get little value from the time it will take to collect data you can’t reliably use. 

You will find understanding and proving both the genetic and genealogical connections among your matches at larger cM ranges is plenty challenging. In this perspective, the temptation work with the smallest segments should take a lower priority.

Tactics or How Should You Gather?

The fundamental question  will always be how long it will take to gather how many matches total in the  range of cMs required for your particular research questions.

The main choices are

  1. gather all possible data  for in one gather no matter how long it takes, followed by periodic updates if more data are needed
  2. gather incrementally by type of data (ICS, segment, tree) or
  3. gather incrementally by cM ranges (collecting all types simultaneous but  in stages by progressively diminishing cM ranges. )

Choice 1: If you have a reasonable expectation of completing a full gather in an amount of time you can tolerate -why not? This is not as bad as it sounds because if you stop a gather and start the same gather later, it will pick up where it left off

Choice 2  gathering by file type diminishes the usefulness of the data since most uses require all types be available. However,, if you don’t have use for a specific type of data not gathering it saves time, of course

Choice 3 Incremental gathering by range is arguably the best approach, especially  If you intend to gather all matches for a kit with more than 6000 matches because  you (or your computer) need to take breaks in the gather process (e.g system updates are likely to interrupt a gather or you need to shut down before a gather is completed). Doing gathers in stages requires some special attention and planning to avoid gaps in your database.

Review How  DNAGedcom’s Gather Works

Tip 6:  The better you understand how DNAGedcom works, the more successful incremental gathering will be for you.  Read on. The rest of this article is for you.

Keep these facts in mind as you plan your gather strategies:

  • The first gather of a comprehensive dataset (matches, ICWs,segments,  and trees) within a given range always take the most time. Subsequent gathers in the same range for updating are much faster.
  • Gathering works at different speeds on the various platforms. Best done on high speed internet connections; computer factors such as processor speed and ram are telaticely unimportant
  • The gather program works match by match, that is all requested data are collected for the first match, then for the second match and so on.
  • When you restart a gather, the application picks up where it left off
  • ICW and tree data are the most time consuming
  • Remember the number of matches and time it takes to gather them increases geometrically as segment size decreases and the total number of matches grows. This means a full gather for a kit of 40,000 that takes less than 10 minutes for large segments  (<30cM could take as much as 8 hours. (Remember MyHeritage addd a minute per match above the minute or so required to collect each match)
  • Don’t collect ethnicity data unless you have a defined purpose. For example, if your ethnicity reveals you are 30% Native American and you haven’t a clue where it’s coming from, ethnicity of matches  s relevant. For most users of mixed European descent, it is likely irrelevant.
  • Save time by collecting several kits on separate platforms at the same time. (you can also run CLM or CMA at the same time on an already gathered kit)
  • Other vendor-related considerations
    • The databases at 23andme, FTDNA and gedmatch are smaller than Ancestry and MyHeritage so they are almost always collectable in hours instead of days.

Sample Plan

Tip 7 Test run by gather the first batch and observing the Time remaining counter as you move into the range you will need for the second batch

To get the most accurate estimates using the status counter, set the cM size range to your target and data sources for a test run. The longer it runs the more accurate the estimated time to completion will be. The running percent completion is % of total matches, not selected matches. Observe the time remaining estimate when you pass the 30 cM  level. Observe the total completed and decide if you have enough data to work with. If so, hit cancel to save what has been collected, noting cM so you can restart a second gather just above that cM

Tip 8 If you have multiple kits to gather on the same platform, keep notes of how long how many kits took at several cM benchmarks, e.g. 30; 20; 15; 8. This will help you better estimate the next kit gathers

Example Plan

Here are is an example of a gather plan for a person with a set of 5 original and/or kits on all five  platforms. . The plan starts with the closest matches on the slow gather platforms and sets the stage to eventually explore all data hoping eventually to correlate some matches across platforms where segment and/or  triangulation data are available.

Background: In this case, the goal is to detect and eventually assign as many valid segments to MRCAs as possible.The hope is to isolate segments needed to solve 3 4-6th generation deadends. This user keeps an excel phased master spreadsheet sorted by chromosome and segment. The task is to weed out  (false positives and pileups excluded) when  possible and associate matches with ancestral lines and MRCAs where possible.

The person studying these data is at an intermediate skill level with respect to segment analysis and uses a variety of third party and homegrown tools to track matches, paint segments,, does both network node and matrix cluster analysis and gedmatch one to one testing, always  looking to  triangulate where possible.  Note for 23andme and gedmatch kits, there are few enough matches that collecting the full dataset is more efficient than incremental gathering. The emphasis is periodic updates to screen for new matches using the DNAGedcom Data tool to automate re gathers.

Ancestry Total Matches 97777
Estimated full gather time all files  
Relatives by generation Est Lower limit CM 95%CI Total matches in range Gather Plan
CLOSE     1st collection
2-3 75 62 1st collection
4-6 20 2610 1st
DISTANT -15 7165  
  8   2ns collection
MyHeritage Total Matches
Estimated full gather time all files  
Relatives by generation Est Lower limit CM 95%CI Total matches in range Gather Plan
CLOSE     1st collection
2-3     1st collection
4-6     1st
DISTANT     2ns collection
FTDNA Total Matches 6889
Estimated full gather time all files  
Relatives by generation Est Lower limit CM 95%CI Total matches in range Gather Plan
CLOSE     1st collection
2-3     1st collection
4-6     1st
DISTANT     2ns collection
23andme total Matches
Estimated full gather time all files Plan single gather update periodically
Gedmatch  Total Matches (based on Tier 1 one to many test
Estimated full gather time all files Plan single gather update periodically

How to Make an Incremental Plan

The sample plan was made using a  systematic approach that considered each of these steps:

  1. Define your research goals and your plan of approach.
    1. Are you a beginner learning how to analyze match data hoping to discover or extend your family tree? Start with your closest matches, not less than 30- cM
    1. Are you experienced using genetic genealogy methods and have specific research goals that can suggest the scope of data you will need to get started? Then, define the scope of matches that theoretically best matches your questions
    1. Do you know which analysis tools and methods are best suited to your research goals
  2. Consider what types of CSV files are needed for the  software you plan to use.
    1. Which DNAGedcom and/or Genetic.Family analysis tools will you be using? New options at DG and GF leverage tagging, ancestral surname seaches, and the creation of cross-platform (merged) “superkits. 
    1.  If you are using 3rd party tools that required gathered data, what types of data (segment? ICW? Trees?) are needed  for those tools  (eg GenomeMate Pro, DNAPainter, RootsFinderDNA Tools, etc.)
  3. Examine a chart of predicted relationships based on cM (examples) to tailor your gather to reflect your immediate and longe range research goals.   How many generations of matches does your project need? Make a list with predicted upper and  lower limits to gather qualifying matches.
  4. Do  a preview of gather time for matches  within your target range to establish how long it will take to gather that range by running a test gather. (See special instructions for MyHeritage )
  5. If the amount of time to gather exceeds your available time, divided matches into gather groups
  6.  
  7.  by cM ranges based on the distribution of matches within generational cM ranges. Your first priority will be what is the smallest cm that serves your research prpse

Two ways to figure out the distribution within a kit of DNAGedcom gathered matches

  1.  Launch a gather with the default minimum and maximum cM setting and closely watch the cM counter roll by and note how many matches have loaded when the counter hits each of your benchmarks e.g, 100, 50, 20, 20, 15, 8 cMs. Divide each of those numbers by total kits to calculate an approximate  distribution of your kits across cm levels.
  2. If you have a very large kit and don’t want to spend an hour watching the counter, use the match file output to observe the number of records up to each  break points in
    1. If you are an advanced Excel user or are familiar with database queries, open the CSV match file and use a counting function or query  to make an inventory of how many matches fall within each  cM ranges of interest.
    1. If you are not familiar with spreadsheets, use Google Sheets to open the CSV file and start scrolling down noting the record number (on the left) where the breaks falll to get an approximate idea how the distribution of matches in your kit looks.
  3. For MyHeritage, download the vendor match list for a kit and determine kits in ranges from that file before launching any gather. Divide the total matches at each break point by the total number of matches and you’ll rapidly see time saving to be had by collecting MH in stages.


Was this article helpful?
4.2 out Of 5 Stars

15 ratings

5 Stars 20%
4 Stars 60%
3 Stars 7%
2 Stars 13%
1 Stars 0%
How can we improve this article?
Please submit the reason for your vote so that we can improve the article.