У меня есть следующие данные:
data = {'f_geno': ["AA", "AA", "AA", "BB", "BB", "BB", "AB", "AB", "AB"],
'ch_geno': ["AA", "AB", "BB", "AA", "AB", "BB", "AA", "BB", "AB"],
'freq_A': [0.50, 0.46, 0.49, 0.57, 0.55, 0.44, 0.37, 0.66, 0.46],
'freq_B': [0.50, 0.54, 0.51, 0.43, 0.45, 0.56, 0.63, 0.34, 0.54]
}
Я уже написал простой калькулятор, который вычисляет значение для каждой строки в AWK и выводит полученное значение в $5:
awk 'BEGIN {
FS = OFS = ","
}
{
if ($1 == "AA" && $2 == "AA") {
$5 = (1 / $3)
} else if ($1 == "AA" && $2 == "AB") {
$5 = (0.5 / $3)
} else if ($1 == "AA" && $2 == "BB") {
$5 = (0.001)
} else if ($1 == "BB" && $2 == "AA") {
$5 = (0.001)
} else if ($1 == "BB" && $2 == "AB") {
$5 = (0.5 / $4)
} else if ($1 == "BB" && $2 == "BB") {
$5 = (1 / $4)
} else if ($1 == "AB" && $2 == "AA") {
$5 = (0.5 / $3)
} else if ($1 == "AB" && $2 == "BB") {
$5 = (0.5 / $4)
} else {
$5 = (($3 + $4) / (4 * $3 * $4))
}
print
}'
Я хотел бы сделать то же самое, что и выше, но на Python.Может кто-нибудь помочь, пожалуйста?
Вы можете использовать .apply() для функции:
def condition(x) -> float:
if x.f_geno == "AA" and x.ch_geno == "AA":
return 1/x.freq_A
if x.f_geno == "AA" and x.ch_geno == "AB" or x.f_geno == "AB" and x.ch_geno == "AA":
return 0.5/x.freq_A
if x.f_geno == "AA" and x.ch_geno == "BB" or x.f_geno == "BB" and x.ch_geno == "AA":
return .001
if x.f_geno == "BB" and x.ch_geno == "AB":
return 0.5/x.freq_B
if x.f_geno == "BB" and x.ch_geno == "BB" or x.f_geno == "AB" and x.ch_geno == "BB":
return 1/x.freq_B
return (x.freq_A + x.freq_B) / (4 * x.freq_A * x.freq_B)
df = pd.DataFrame(data=data)
df["result"] = df.apply(condition, axis=1)
print(df)
Выход:
f_geno ch_geno freq_A freq_B result
0 AA AA 0.50 0.50 2.000000
1 AA AB 0.46 0.54 1.086957
2 AA BB 0.49 0.51 0.001000
3 BB AA 0.57 0.43 0.001000
4 BB AB 0.55 0.45 1.111111
5 BB BB 0.44 0.56 1.785714
6 AB AA 0.37 0.63 1.351351
7 AB BB 0.66 0.34 2.941176
8 AB AB 0.46 0.54 1.006441
df = pd.DataFrame(data=data)
faa = df.f_geno == "AA"
chaa = df.ch_geno == "AA"
fab = df.f_geno == "AB"
chab = df.ch_geno == "AB"
fbb = df.f_geno == "BB"
chbb = df.ch_geno == "BB"
masks = [(faa & chaa),
(faa & chab) | (fab & chaa),
(faa & chbb) | (fbb & chaa),
(fbb & chbb),
(fbb & chab) | (fab & chbb)]
vals = [1 / df.freq_A,
0.5 / df.freq_A,
0.001,
1 / df.freq_B,
0.5 / df.freq_B]
default = (df.freq_A + df.freq_B) / (4 * df.freq_A * df.freq_B)
df["result"] = np.select(masks, vals, default=default)
print(df)
f_geno ch_geno freq_A freq_B result
0 AA AA 0.50 0.50 2.000000
1 AA AB 0.46 0.54 1.086957
2 AA BB 0.49 0.51 0.001000
3 BB AA 0.57 0.43 0.001000
4 BB AB 0.55 0.45 1.111111
5 BB BB 0.44 0.56 1.785714
6 AB AA 0.37 0.63 1.351351
7 AB BB 0.66 0.34 1.470588
8 AB AB 0.46 0.54 1.006441
Производительность с 90 тыс. строк:
#90k rows
df = pd.DataFrame(data=data)
df = pd.concat([df] * 10000, ignore_index=True)
In [98]: %timeit df["result"] = df.apply(condition, axis=1)
5.96 s ± 585 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [99]: %timeit df["result"] = np.select(masks, vals, default=default)
1.59 ms ± 20.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)