How Can We Help?
Tips and Tactics for Gathering DNA Data
Tips and Tactics for Gathering DNA Data
Overview
Many users set out to gather everything possible on their first try only to discover a full gather for FTDNA, Ancestry, or MyHeritage may take hours, days or even weeks. Preplanning how much and which DNA match data to gather with the DNAGedcom Client (DGC) can save time and help your research keep moving even while gathering more data. Luckily, subsequent DGC gathers take a lot less time than the first gather, but getting past that first gather is the topic of this article. Gedmatch and 23andme gathers collect all matches automatically, but because of their smaller databases, complete gathers are generally pretty speedy.
TIP 1: If you are new to genetic genealogy or just plain impatient, set the cM range minimum to 30 cM when you collect Ancestry, FTDNA or MyHeritage kits. This setting will harvest the matches you most need to solve first. The good news is most researchers don’t need to collect every match for every kit to get good results. You will find the skills you build working with the largest matches first are the foundation you will need to work with the trickier smaller segments. Starting with your closest matches (the folks who share the most DNA) is the approach endorsed by experts.
If you are more experienced you already know solving the big segments first is a must. The good news is with DNAGedcom you can always go back and collect the rest when or if needed later, so you can safely get down to work with the results of your large segment gathers. You can even run more, multiple gathers in the background while you run the DNAGedcom cluster programs on the first batch of data.
By the way, once you get a sense for how long a kit of a certain size is likely to take, you will also know which kits can be gathered in a single session
BUT: If Tip 1 doesn’t appeal to you for any reason, read on. Because of the time required to gather smaller segments, knowing more about how DGC works and having a “gather plan” tailored to your research goals will save time and frustration in the long run. This article aims to help you consider what your options are based on understanding why gathers can be astonishingly slow and how DNAGedcom works. Broadly, we cover:
- What you should know about how DNAGedcom’s “gather” tools to use them efficiently?
- How much and which types data do you really need?
- How to estimate how long any gather will take
- How to make an incremental gather plan tailored to your kit(s) and research goals?
The Gather Challenge
Tip 2 Understanding why some gathers can be completed in five minutes, others in hours, days or weeks is important if you want to minimize frustration and plan a fool proof gather.
First, the more kits in vendor’s database, the more matches you are likely to find for your kit, allowing for factors such differences in family sizes over time, being in an ethnic group well represented in a particular or any DNA databases, etc. Because Ancestry, FTDNA, and MyHeritage have sold the most kits, those are the kits likely to take the longest to gather if you have kits on all five of DNAGedcom’s supported platforms.
Then If you recall that the number of grandparents each generation is the square of the previous generation (e.g. 2,4,16,….) so goes the increase in matches roughly speaking, This translates to the smaller the total amount of DNA shared the more matches there will be and the longer it will take to gather data.
Here is an example: Gathering matches, ICW and trees for an Ancestry kit with 40,000 matches took neraly as 20 hours. However, gathering only matches, ICW and trees above 30 cM for the same kit took less than 5 minutes! The reason for the difference is that more than 90% of the matches were below 30 cM!
Unfortunately, vendors sometimes impose restrictions on 3rd party software accessing their matches. MyHeritage compounds the challenge of collecting by slowing every data gather, large or small, by imposing an additional wait time of more than a minute between each match. Even if you only want to download full data for only 10,000 MyHeritage matches, consider if each match takes a minute to collect, plus a minute wait before the next match, gathering those 10,000 matches would take 18 uninterrupted days!
Still, the “collect it all now ” strategy can be tempting to users who have the stamina to attend to launching and often restarting an extended gather. But understanding how DGC works is essential since a gather that goes for more than day can easily collide with operating systems updates, power failures and user errors of all sorts. Unless you know how to restart a gather without losing data, avoid this approach. It is not necessary
But why here is a viable alternative: use an incremental gather plan to gather the closest matches first. While you follow the leads in the first gathered matches you can reset the lower cM to keep gathering data for matches who share less DNA. Because a subsequent gather with the samer upper limit starts where the previous one left off, the new gather will pick up any new matches above the previous lower limit and then proceed to pick up matches in the new lower range.
An incremental approach has advantages. this approach lets you work on matches who are more closely related to your kit even while you extend the lower limit of your gather to matches who share less DNA.. Thus, rather than spending a week to gather a full Ancestry dataset before you begin to investigate the relationships among matches, you can run the Autosomal Tools (Collins Leeds Martrix Method and Chromosome Matrix Analysis) on the data you have gathered, then repeating cluster tests as new matches are gathered. Because many smaller matches are often more easily interpreted in the context of the larger matches you have already solved, this approach mitigates the problem of extended gather times while focusing on matches whose most common recent ancestors (MRCAs) are closer in time and have the least equivocal segment sizes.
What Data Do You Actually Need to Gather?
Tip 3: Consider how much data you will need to solve your particular research questions
For example, a person solving a 6-generation tree is likely to make good progress with matches 30 cM or larger ,but finding good trees at 20 cMs might want to expand the range. A person solving an adoption within 1 or 2 generations only needs matches in higher total CMs ranges associated with siblings, aunts/uncles, grandparents and great grandparents’ generations. 30 cMs again Is plenty. In contrast a person systematically seeking to push a tree back as many generations as possible will eventually find a few valuable, provable matches with good trees in the fringes (10-6 cM) if they are willing to do tedious work of gathering thousands of essentially unusable matches at the same time.
For example, if you are trying to discover from DNA who your maternal grandmother was, collecting matches who share only 10 cMs with you will generally be less productive than focusing on matches who share considerably more DNA since the most recent common ancestors of interest(MRCAs) who are closer to your study timeframe would be 1st to 4th degree cousins, aunt/uncles, and great-grandparents predicted to share between x and x cms.
In contrast, iif your goal is to map and paint each segment of shared DNA to its ancestral source, you will wind up working with small segments. This is where planning incremental gathers can be a lifesaver.
TIP 4 Consider data requirements of genetic genealogy software tools you would like to use?
If you follow advances in software tools designed to help untangle and identify segments of DNA, you will know match data are becoming increasingly accessible to automated matrix and network-node cluster analysis, chromosome painting by ancestral line, surname analysis for example are making smaller matches easier to explore. Each application has specific requirements for which types and how much data you will need to gather
DNAGedcom and Genetic.Family analysis tools are rapidly expanding. Are the matches you want most to study with DNAGedcom’s clustering tools going to be picked up in your gather? Are you going to use the Genetic.Family data bridge, if so you will want to be sure to gather every bit of tree data you can. If you are going to use DNAPainter’s new Cluster Auto Painter auto-paint feature that uses DNAGedcom Clusters, have the matches you want see segments painted for been gathered by DNAGedcom prior to running the Collins Leeds Method (CLM) option.
DNAGEDcom supports a number of valuable third party products such as GenomeMate Pro and Rootsfinder DNA Tools. Each has specific file type requirements to do their magic. Knowing in advance, what is needed can save time gathering in the long run.
TIP 5 Decide how much risk you can (or should) tolerate
The smaller the segment the higher the probability mistaking a false positive or a pile up type match with a fascinating tree for a valuable clue.. In the first case, it may even be the segment matches neither parent (Estes, 2020) and is of no use and in the second case, the segment maybe shared by so many people it may be associated with an endogamous group and not specific surnames or maybe just uninterpretable pileup. To complicate matters further the smaller the segment the more likely you;ll run into differences between chipset versions that differ in chromosome regions sampled.
Still, there is the risk of missing a valuable clue amidst the huge collection of small segments. if you are familiar with critical concepts in genetic genealogy and are experienced enough to be confident in your methods working with smaller segment, there will be a few provably valid matches with useful ancestral to be had between 15 cM and 6 cMs. But, even within this range, the larger the segment the safer you are.
These tips beg one more question: How much do you know about genetic genealogy? This may seem harsh, but unless you make a commitment to learning the fundamental concepts and methods required to work with both DNA and genealogical data, you will get little value from the time it will take to collect data you can’t reliably use.
You will find understanding and proving both the genetic and genealogical connections among your matches at larger cM ranges is plenty challenging. In this perspective, the temptation work with the smallest segments should take a lower priority.
Tactics or How Should You Gather?
The fundamental question will always be how long it will take to gather how many matches total in the range of cMs required for your particular research questions.
The main choices are
- gather all possible data for in one gather no matter how long it takes, followed by periodic updates if more data are needed
- gather incrementally by type of data (ICS, segment, tree) or
- gather incrementally by cM ranges (collecting all types simultaneous but in stages by progressively diminishing cM ranges. )
Choice 1: If you have a reasonable expectation of completing a full gather in an amount of time you can tolerate -why not? This is not as bad as it sounds because if you stop a gather and start the same gather later, it will pick up where it left off
Choice 2 gathering by file type diminishes the usefulness of the data since most uses require all types be available. However,, if you don’t have use for a specific type of data not gathering it saves time, of course
Choice 3 Incremental gathering by range is arguably the best approach, especially If you intend to gather all matches for a kit with more than 6000 matches because you (or your computer) need to take breaks in the gather process (e.g system updates are likely to interrupt a gather or you need to shut down before a gather is completed). Doing gathers in stages requires some special attention and planning to avoid gaps in your database.
Review How DNAGedcom’s Gather Works
Tip 6: The better you understand how DNAGedcom works, the more successful incremental gathering will be for you. Read on. The rest of this article is for you.
Keep these facts in mind as you plan your gather strategies:
- The first gather of a comprehensive dataset (matches, ICWs,segments, and trees) within a given range always take the most time. Subsequent gathers in the same range for updating are much faster.
- Gathering works at different speeds on the various platforms. Best done on high speed internet connections; computer factors such as processor speed and ram are telaticely unimportant
- The gather program works match by match, that is all requested data are collected for the first match, then for the second match and so on.
- When you restart a gather, the application picks up where it left off
- ICW and tree data are the most time consuming
- Remember the number of matches and time it takes to gather them increases geometrically as segment size decreases and the total number of matches grows. This means a full gather for a kit of 40,000 that takes less than 10 minutes for large segments (<30cM could take as much as 8 hours. (Remember MyHeritage addd a minute per match above the minute or so required to collect each match)
- Don’t collect ethnicity data unless you have a defined purpose. For example, if your ethnicity reveals you are 30% Native American and you haven’t a clue where it’s coming from, ethnicity of matches s relevant. For most users of mixed European descent, it is likely irrelevant.
- Save time by collecting several kits on separate platforms at the same time. (you can also run CLM or CMA at the same time on an already gathered kit)
- Other vendor-related
considerations
- The databases at 23andme, FTDNA and gedmatch are smaller than Ancestry and MyHeritage so they are almost always collectable in hours instead of days.
Sample Plan
Tip 7 Test run by gather the first batch and observing the Time remaining counter as you move into the range you will need for the second batch
To get the most accurate estimates using the status counter, set the cM size range to your target and data sources for a test run. The longer it runs the more accurate the estimated time to completion will be. The running percent completion is % of total matches, not selected matches. Observe the time remaining estimate when you pass the 30 cM level. Observe the total completed and decide if you have enough data to work with. If so, hit cancel to save what has been collected, noting cM so you can restart a second gather just above that cM
Tip 8 If you have multiple kits to gather on the same platform, keep notes of how long how many kits took at several cM benchmarks, e.g. 30; 20; 15; 8. This will help you better estimate the next kit gathers
Example Plan
Here are is an example of a gather plan for a person with a set of 5 original and/or kits on all five platforms. . The plan starts with the closest matches on the slow gather platforms and sets the stage to eventually explore all data hoping eventually to correlate some matches across platforms where segment and/or triangulation data are available.
Background: In this case, the goal is to detect and eventually assign as many valid segments to MRCAs as possible.The hope is to isolate segments needed to solve 3 4-6th generation deadends. This user keeps an excel phased master spreadsheet sorted by chromosome and segment. The task is to weed out (false positives and pileups excluded) when possible and associate matches with ancestral lines and MRCAs where possible.
The person studying these data is at an intermediate skill level with respect to segment analysis and uses a variety of third party and homegrown tools to track matches, paint segments,, does both network node and matrix cluster analysis and gedmatch one to one testing, always looking to triangulate where possible. Note for 23andme and gedmatch kits, there are few enough matches that collecting the full dataset is more efficient than incremental gathering. The emphasis is periodic updates to screen for new matches using the DNAGedcom Data tool to automate re gathers.
Ancestry Total Matches 97777 | |||
Estimated full gather time all files | |||
Relatives by generation | Est Lower limit CM 95%CI | Total matches in range | Gather Plan |
CLOSE | 1st collection | ||
2-3 | 75 | 62 | 1st collection |
4-6 | 20 | 2610 | 1st |
DISTANT | -15 | 7165 | |
8 | 2ns collection | ||
MyHeritage Total Matches | |||
Estimated full gather time all files | |||
Relatives by generation | Est Lower limit CM 95%CI | Total matches in range | Gather Plan |
CLOSE | 1st collection | ||
2-3 | 1st collection | ||
4-6 | 1st | ||
DISTANT | 2ns collection | ||
FTDNA Total Matches 6889 | |||
Estimated full gather time all files | |||
Relatives by generation | Est Lower limit CM 95%CI | Total matches in range | Gather Plan |
CLOSE | 1st collection | ||
2-3 | 1st collection | ||
4-6 | 1st | ||
DISTANT | 2ns collection | ||
23andme total Matches | |||
Estimated full gather time all files | Plan single gather update periodically | ||
Gedmatch Total Matches (based on Tier 1 one to many test | |||
Estimated full gather time all files | Plan single gather update periodically |
How to Make an Incremental Plan
The sample plan was made using a systematic approach that considered each of these steps:
- Define your research
goals and your plan of approach.
- Are you a beginner learning how to analyze match data hoping to discover or extend your family tree? Start with your closest matches, not less than 30- cM
- Are you experienced using genetic genealogy methods and have specific research goals that can suggest the scope of data you will need to get started? Then, define the scope of matches that theoretically best matches your questions
- Do you know which analysis tools and methods are best suited to your research goals
- Consider what types of
CSV files are needed for the software
you plan to use.
- Which DNAGedcom and/or Genetic.Family analysis tools will you be using? New options at DG and GF leverage tagging, ancestral surname seaches, and the creation of cross-platform (merged) “superkits.
- If you are using 3rd party tools that required gathered data, what types of data (segment? ICW? Trees?) are needed for those tools (eg GenomeMate Pro, DNAPainter, RootsFinderDNA Tools, etc.)
- Examine a chart of predicted relationships based on cM (examples) to tailor your gather to reflect your immediate and longe range research goals. How many generations of matches does your project need? Make a list with predicted upper and lower limits to gather qualifying matches.
- Do a preview of gather time for matches within your target range to establish how long it will take to gather that range by running a test gather. (See special instructions for MyHeritage )
- If the amount of time to gather exceeds your available time, divided matches into gather groups
- by cM ranges based on the distribution of matches within generational cM ranges. Your first priority will be what is the smallest cm that serves your research prpse
Two ways to figure out the distribution within a kit of DNAGedcom gathered matches
- Launch a gather with the default minimum and maximum cM setting and closely watch the cM counter roll by and note how many matches have loaded when the counter hits each of your benchmarks e.g, 100, 50, 20, 20, 15, 8 cMs. Divide each of those numbers by total kits to calculate an approximate distribution of your kits across cm levels.
- If you have a very large kit
and don’t want to spend an hour watching the counter, use the match file output
to observe the number of records up to each
break points in
- If you are an advanced Excel user or are familiar with database queries, open the CSV match file and use a counting function or query to make an inventory of how many matches fall within each cM ranges of interest.
- If you are not familiar with spreadsheets, use Google Sheets to open the CSV file and start scrolling down noting the record number (on the left) where the breaks falll to get an approximate idea how the distribution of matches in your kit looks.
- For MyHeritage, download the vendor match list for a kit and determine kits in ranges from that file before launching any gather. Divide the total matches at each break point by the total number of matches and you’ll rapidly see time saving to be had by collecting MH in stages.